JSC-BIO-2710

Data Intensive Computing for Applied Bioinformatics

Table Of Contents

Previous topic

Week Ten 20-Mar-12

This Page

Week Eleven 27-Mar-12

Tuesday

Note

Learning Goals:

  • Create complete python program to parse BLAST with taxonomy
  • Reformat BLAST output to include percent identity
  • Modify python program to filter hits based on % identity
  • Output top hits based on different hit criteria




 
 
 
 
 
 
 
 

Video



Lecture

Full python program to count hits and fetch taxonomy:

#!/usr/bin/env python

"""

James Vincent
Mar 27 , 2012

parseBlast.py
Version 2.0

Open a BLAST text file
loop over lines
keep count of hits per GI
split lines into fields
Extract GI number from each line
Convert GI to taxon ID number
convert taxon ID to genus and species text strings

Output: top five hits by count of GI with genus and species for each
"""

import sys
from collections import Counter
from Bio import Entrez


topHits = 10                  # Number of top hits to output
Entrez.email = "jim@uvm.edu"  # Always tell NCBI who you are

def getTaxonID(giNumber):
   """ return taxon id from entrez given input gi number """

   myTaxonList = Entrez.read(Entrez.elink(dbfrom="nucleotide",db="taxonomy",id=giNumber))
   myTaxonID = myTaxonList[0]["LinkSetDb"][0]["Link"][0]["Id"]

   return myTaxonID


def getTaxonGenusSpecies(taxonID):
   """ return genus as string from Entrez given input taxon ID number """

   handle = Entrez.read(Entrez.esummary(db="taxonomy", id=taxonID))
   myGenus = handle[0]["Genus"]
   mySpecies = handle[0]["Species"]

   return myGenus,mySpecies


def getScore(blastLine):
   """ parse blast output line and return score """

   # BLAST input file has hit lines like this:
   # fmt "6 qseqid sseqid evalue bitscore"
   # 1        gi|219856848|ref|NR_024667.1|   0.0     2551
   myFields = blastLine.strip().split()

   thisScore  = float(myFields[3])

   return thisScore


def getGI(blastLine):
   """ parse blast output line and return GI of hit """

   # blast line looks like this:
   # 1        gi|219856848|ref|NR_024667.1|   0.0     2551

   # spit line into fields by white space
   myFields = blastLine.strip().split()

   # split subject (hit) ID field on |
   myIDs = myFields[1].split("|")
   thisGI= myIDs[1]

   return thisGI


# Get input file name
myInfileName = sys.argv[1]
infile = open(myInfileName)

# Counter for hit GIs
myGICounter = Counter()

for thisLine in infile.readlines():
   # Extract score
   thisScore = getScore(thisLine)
   # Extract GI
   thisGI  = getGI(thisLine)
   # Count this GI hit
   myGICounter[thisGI] += 1


# Print top five hits
# print myGICounter.most_common(5)
for thisItem in myGICounter.most_common(topHits):
   thisGI = thisItem[0]
   thisCount = thisItem[1]

   # Get taxon ID
   taxonID = getTaxonID(thisGI)
   # Get genus, species
   genus,species =  getTaxonGenusSpecies(taxonID)
   print thisGI,thisCount,taxonID,genus,species



Reformat BLAST output to include percent identity:

$BLAST_FORMATTER -archive $OUTFILE -outfmt "6 qseqid sseqid qlen length pident evalue bitscore " -out $OUTFILE.table.pident


Add functions to parse percent idenity and hit length:

#!/usr/bin/env python

"""

James Vincent
Mar 27 , 2012

parseBlast.py
Version 3.0

Open a BLAST text file
loop over lines
keep count of hits per GI
split lines into fields
Extract GI number from each line
Extract hit length and % idenity
Only count hits that meet criteria for length and % identity

Convert GI to taxon ID number
convert taxon ID to genus and species text strings

Output: top five hits by count of GI with genus and species for each
"""

import sys
from collections import Counter
from Bio import Entrez


topHits = 10                  # Number of top hits to output
minHitIdentity = 100.0         # Min percent idenity of hit to keep

Entrez.email = "jim@uvm.edu"  # Always tell NCBI who you are

def getTaxonID(giNumber):
   """ return taxon id from entrez given input gi number """

   myTaxonList = Entrez.read(Entrez.elink(dbfrom="nucleotide",db="taxonomy",id=giNumber))
   myTaxonID = myTaxonList[0]["LinkSetDb"][0]["Link"][0]["Id"]

   return myTaxonID


def getTaxonGenusSpecies(taxonID):
   """ return genus as string from Entrez given input taxon ID number """

   handle = Entrez.read(Entrez.esummary(db="taxonomy", id=taxonID))
   myGenus = handle[0]["Genus"]
   mySpecies = handle[0]["Species"]

   return myGenus,mySpecies

#  -- BLAST parsing functions assume
#     blast table format lines look like this
# qseqid sseqid qlen length pident evalue bitscore
#1   gi|343198516|ref|NR_042772.1|   100     100     100.00  1e-47    185
#1   gi|219878377|ref|NR_025516.1|   100     90      96.67   4e-37    150

def getQueryLength(blastLine):
   """ return length of query from blast line """

   myFields = blastLine.strip().split()

   thisQueryLength  = int(myFields[2])

   return thisQueryLength


def getHitLength(blastLine):
   """ return length of hit from blast line """

   myFields = blastLine.strip().split()

   thisHitLength  = int(myFields[3])

   return thisHitLength


def getIdent(blastLine):
   """ parse blast output line and return percent identity of alignment """

   myFields = blastLine.strip().split()

   thisIdent  = float(myFields[4])

   return thisIdent


def getGI(blastLine):
   """ parse blast output line and return GI of hit """

   myFields = blastLine.strip().split()

   # split subject (hit) ID field on |
   myIDs = myFields[1].split("|")
   thisGI= myIDs[1]

   return thisGI


# Get input file name
myInfileName = sys.argv[1]
infile = open(myInfileName)

# Counter for hit GIs

myGICounter = Counter()

for thisLine in infile.readlines():

   # Extract GI
   thisGI  = getGI(thisLine)

   # Extract hit  and query length
   thisHitLength  = getHitLength(thisLine)
   thisQueryLength  = getQueryLength(thisLine)

   # Extract hit percent identity
   thisIdent  = getIdent(thisLine)

   # Count this GI hit based on conditions
   if (thisHitLength == thisQueryLength)  and (thisIdent >= minHitIdentity):
      myGICounter[thisGI] += 1



# Print top five hits
# print myGICounter.most_common(5)
for thisItem in myGICounter.most_common(topHits):
   thisGI = thisItem[0]
   thisCount = thisItem[1]

   # Get taxon ID
   taxonID = getTaxonID(thisGI)
   # Get genus, species
   genus,species =  getTaxonGenusSpecies(taxonID)
   print thisGI,thisCount,taxonID,genus,species



Homework

Reading

Review programs in lecture materials for this week and last

Exercises

Make sure you can write the programs shown in lecture

Turn In

Write a python program that parses BLAST table formatted output:

  • Use week9.fa BLAST results

  • Format BLAST results in table format and include percent identity

  • Count number of hits at 100% identity and same length as query for each query ID (not GI)

  • Write output that contains two columns: QUERY ID and number of hits:

    # Number of hits at 100% identity, full length of query
    # QUERY ID  Number Hits
    1        23
    2        21
    3        20
    4        14
    5        12

Thursday




Note

Learning Goals:

  • Create complete python program to parse BLAST table output and explore hit counts
  • Output hit table with per query hits and %
  • Output top hits based on different hit criteria


Video



Lecture

Full python program to count hits at 100% length and 100% identity per query:

#!/usr/bin/env python

"""

James Vincent
Mar 29 , 2012

parseBlastQueryHits.py
Version 1.0

Open a BLAST text file
loop over lines
split lines into fields
keep count of hits per query that are 100% length of query
and 100% identity
Output two column table with query ID, number of hits

"""

import sys
from collections import Counter
from Bio import Entrez


topHits = 10                  # Number of top hits to output
minHitIdentity = 100.0         # Min percent idenity of hit to keep

Entrez.email = "jim@uvm.edu"  # Always tell NCBI who you are

def getTaxonID(giNumber):
   """ return taxon id from entrez given input gi number """

   myTaxonList = Entrez.read(Entrez.elink(dbfrom="nucleotide",db="taxonomy",id=giNumber))
   myTaxonID = myTaxonList[0]["LinkSetDb"][0]["Link"][0]["Id"]

   return myTaxonID


def getTaxonGenusSpecies(taxonID):
   """ return genus as string from Entrez given input taxon ID number """

   handle = Entrez.read(Entrez.esummary(db="taxonomy", id=taxonID))
   myGenus = handle[0]["Genus"]
   mySpecies = handle[0]["Species"]

   return myGenus,mySpecies

#  -- BLAST parsing functions assume
#     blast table format lines look like this
# qseqid sseqid qlen length pident evalue bitscore
#1   gi|343198516|ref|NR_042772.1|   100     100     100.00  1e-47    185
#1   gi|219878377|ref|NR_025516.1|   100     90      96.67   4e-37    150

def getQueryLength(blastLine):
   """ return length of query from blast line """

   myFields = blastLine.strip().split()

   thisQueryLength  = int(myFields[2])

   return thisQueryLength


def getHitLength(blastLine):
   """ return length of hit from blast line """

   myFields = blastLine.strip().split()

   thisHitLength  = int(myFields[3])

   return thisHitLength


def getIdent(blastLine):
   """ parse blast output line and return percent identity of alignment """

   myFields = blastLine.strip().split()

   thisIdent  = float(myFields[4])

   return thisIdent

def getGI(blastLine):
   """ parse blast output line and return GI of hit """

   myFields = blastLine.strip().split()

   # split subject (hit) ID field on |
   myIDs = myFields[1].split("|")
   thisGI= myIDs[1]

   return thisGI


def getQueryID(blastLine):
   """ parse blast output line and return query ID of hit """

   myFields = blastLine.strip().split()

   # split subject (hit) ID field on |
   myQueryID = myFields[0]

   return myQueryID


# Get input file name
myInfileName = sys.argv[1]
infile = open(myInfileName)

# Counter for hit GIs

#myGICounter = Counter()
myQueryCounter = Counter()

for thisLine in infile.readlines():

   # Extract GI
   # thisGI  = getGI(thisLine)

   # Extract query ID
   thisQueryID  = str(getQueryID(thisLine))

   # Extract hit  and query length
   thisHitLength  = getHitLength(thisLine)
   thisQueryLength  = getQueryLength(thisLine)

   # Extract hit percent identity
   thisIdent  = getIdent(thisLine)

   # Count this GI hit based on conditions
   if (thisHitLength == thisQueryLength)  and (thisIdent >= minHitIdentity):
      myQueryCounter[thisQueryID] += 1


for thisKey in myQueryCounter.keys():
   print thisKey,'\t',myQueryCounter[thisKey]

Submit multiple jobs to queue on lonestar:

#!/bin/bash

#$ -V                # inherit shell environment
#$ -l h_rt=00:05:00  # wall time limit
#$ -q development    # run in development queue - 1 hour max
#$ -pe 1way 12       # Number of nodes and cores
#$ -A TG-MCB110143   # TACC account

#$ -N Week11_Thurs   # Name of job and stdout files
#$ -cwd              # Run job from the directory it was submitted
#$ -j y

#------------------------
#
# James Vincent
# March 29, 2012
#
# Run blast on week9.fa vs 16SMicrobial database
# Contains 100 reads from 16S of s_mutans and 100 reads from m.komossens
# randomly distributed over 16S rRNA, 100bp length reads
# Reformat output to include Query seq-id, subject seq-id, score and e-value
#
#------------------------


# BLAST programs and variables

# TACC lonestar uses module system to provide blast
module load blast
module load python


# Database
DB=$WORK/JSCBIO2710/blast/databases/16SMicrobial

# Query
QUERY=week9.fa
OUTFILE=$QUERY.blast.asn

# BLAST output format: 11 is ASN, 6 is table no header
OUTFMT=11

# BLAST programs loaded by module command
BLASTN=blastn
BLAST_FORMATTER=blast_formatter
BLASTDBCMD=blastdbcmd

# ---!! Beware comments followed by $:  #$ will be interpeted by qsub
# ---   add extra space:  #  $

# Run blast
$BLASTN -db $DB -query $QUERY -outfmt $OUTFMT -out $OUTFILE

# Reformat ASN to hit custom hit table
$BLAST_FORMATTER -archive $OUTFILE -outfmt "6 qseqid sseqid qlen length pident evalue bitscore " -out $OUTFILE.table.pident

# Parse BLAST output with python program to get best hits
.parseBlast_v3.py  $OUTFILE.table.pident

Rename jobs in queue script to keep track:

This line in qsub script:

#$ -N Week11_Thurs   # Name of job and stdout files

is the name used to create the job output files:

Week11_Thurs.o522740
Week11_Thurs.po522740


Change the job name to clearly describe different jobs:

#$ -N Week11_Control1   # Name of job and stdout files

#$ -N Week11_Control2   # Name of job and stdout files

#$ -N Week11_Sample1    # Name of job and stdout files

Homework

Reading Exercises

Write the programs from this week (given above) on your own. Make sure you can write small functions and complete programs

Turn In:

Run BLAST jobs on the three FASTA files given below.

Copy the fasta files to your own working directory.

Modify your job script to keep track of which job you are running.

Convert the BLAST output to table format and inclue percent identity.

Parse the BLAST table output with your program (example given in Tues above).
to produce one line of output for each hit GI : GI, Count of hits, taxonID, genus, species


Turn in the resulting table for each of three FASTA files:

/work/00921/tg801771/JSCBIO2710/lectures/week11/Thurs/control1.fa
/work/00921/tg801771/JSCBIO2710/lectures/week11/Thurs/control2.fa
/work/00921/tg801771/JSCBIO2710/lectures/week11/Thurs/control3.fa