Note
Learning Goals:
Full python program to count hits and fetch taxonomy:
#!/usr/bin/env python
"""
James Vincent
Mar 27 , 2012
parseBlast.py
Version 2.0
Open a BLAST text file
loop over lines
keep count of hits per GI
split lines into fields
Extract GI number from each line
Convert GI to taxon ID number
convert taxon ID to genus and species text strings
Output: top five hits by count of GI with genus and species for each
"""
import sys
from collections import Counter
from Bio import Entrez
topHits = 10 # Number of top hits to output
Entrez.email = "jim@uvm.edu" # Always tell NCBI who you are
def getTaxonID(giNumber):
""" return taxon id from entrez given input gi number """
myTaxonList = Entrez.read(Entrez.elink(dbfrom="nucleotide",db="taxonomy",id=giNumber))
myTaxonID = myTaxonList[0]["LinkSetDb"][0]["Link"][0]["Id"]
return myTaxonID
def getTaxonGenusSpecies(taxonID):
""" return genus as string from Entrez given input taxon ID number """
handle = Entrez.read(Entrez.esummary(db="taxonomy", id=taxonID))
myGenus = handle[0]["Genus"]
mySpecies = handle[0]["Species"]
return myGenus,mySpecies
def getScore(blastLine):
""" parse blast output line and return score """
# BLAST input file has hit lines like this:
# fmt "6 qseqid sseqid evalue bitscore"
# 1 gi|219856848|ref|NR_024667.1| 0.0 2551
myFields = blastLine.strip().split()
thisScore = float(myFields[3])
return thisScore
def getGI(blastLine):
""" parse blast output line and return GI of hit """
# blast line looks like this:
# 1 gi|219856848|ref|NR_024667.1| 0.0 2551
# spit line into fields by white space
myFields = blastLine.strip().split()
# split subject (hit) ID field on |
myIDs = myFields[1].split("|")
thisGI= myIDs[1]
return thisGI
# Get input file name
myInfileName = sys.argv[1]
infile = open(myInfileName)
# Counter for hit GIs
myGICounter = Counter()
for thisLine in infile.readlines():
# Extract score
thisScore = getScore(thisLine)
# Extract GI
thisGI = getGI(thisLine)
# Count this GI hit
myGICounter[thisGI] += 1
# Print top five hits
# print myGICounter.most_common(5)
for thisItem in myGICounter.most_common(topHits):
thisGI = thisItem[0]
thisCount = thisItem[1]
# Get taxon ID
taxonID = getTaxonID(thisGI)
# Get genus, species
genus,species = getTaxonGenusSpecies(taxonID)
print thisGI,thisCount,taxonID,genus,species
Reformat BLAST output to include percent identity:
$BLAST_FORMATTER -archive $OUTFILE -outfmt "6 qseqid sseqid qlen length pident evalue bitscore " -out $OUTFILE.table.pident
Add functions to parse percent idenity and hit length:
#!/usr/bin/env python
"""
James Vincent
Mar 27 , 2012
parseBlast.py
Version 3.0
Open a BLAST text file
loop over lines
keep count of hits per GI
split lines into fields
Extract GI number from each line
Extract hit length and % idenity
Only count hits that meet criteria for length and % identity
Convert GI to taxon ID number
convert taxon ID to genus and species text strings
Output: top five hits by count of GI with genus and species for each
"""
import sys
from collections import Counter
from Bio import Entrez
topHits = 10 # Number of top hits to output
minHitIdentity = 100.0 # Min percent idenity of hit to keep
Entrez.email = "jim@uvm.edu" # Always tell NCBI who you are
def getTaxonID(giNumber):
""" return taxon id from entrez given input gi number """
myTaxonList = Entrez.read(Entrez.elink(dbfrom="nucleotide",db="taxonomy",id=giNumber))
myTaxonID = myTaxonList[0]["LinkSetDb"][0]["Link"][0]["Id"]
return myTaxonID
def getTaxonGenusSpecies(taxonID):
""" return genus as string from Entrez given input taxon ID number """
handle = Entrez.read(Entrez.esummary(db="taxonomy", id=taxonID))
myGenus = handle[0]["Genus"]
mySpecies = handle[0]["Species"]
return myGenus,mySpecies
# -- BLAST parsing functions assume
# blast table format lines look like this
# qseqid sseqid qlen length pident evalue bitscore
#1 gi|343198516|ref|NR_042772.1| 100 100 100.00 1e-47 185
#1 gi|219878377|ref|NR_025516.1| 100 90 96.67 4e-37 150
def getQueryLength(blastLine):
""" return length of query from blast line """
myFields = blastLine.strip().split()
thisQueryLength = int(myFields[2])
return thisQueryLength
def getHitLength(blastLine):
""" return length of hit from blast line """
myFields = blastLine.strip().split()
thisHitLength = int(myFields[3])
return thisHitLength
def getIdent(blastLine):
""" parse blast output line and return percent identity of alignment """
myFields = blastLine.strip().split()
thisIdent = float(myFields[4])
return thisIdent
def getGI(blastLine):
""" parse blast output line and return GI of hit """
myFields = blastLine.strip().split()
# split subject (hit) ID field on |
myIDs = myFields[1].split("|")
thisGI= myIDs[1]
return thisGI
# Get input file name
myInfileName = sys.argv[1]
infile = open(myInfileName)
# Counter for hit GIs
myGICounter = Counter()
for thisLine in infile.readlines():
# Extract GI
thisGI = getGI(thisLine)
# Extract hit and query length
thisHitLength = getHitLength(thisLine)
thisQueryLength = getQueryLength(thisLine)
# Extract hit percent identity
thisIdent = getIdent(thisLine)
# Count this GI hit based on conditions
if (thisHitLength == thisQueryLength) and (thisIdent >= minHitIdentity):
myGICounter[thisGI] += 1
# Print top five hits
# print myGICounter.most_common(5)
for thisItem in myGICounter.most_common(topHits):
thisGI = thisItem[0]
thisCount = thisItem[1]
# Get taxon ID
taxonID = getTaxonID(thisGI)
# Get genus, species
genus,species = getTaxonGenusSpecies(taxonID)
print thisGI,thisCount,taxonID,genus,species
Reading
Review programs in lecture materials for this week and last
Exercises
Make sure you can write the programs shown in lecture
Turn In
Write a python program that parses BLAST table formatted output:
Use week9.fa BLAST results
Format BLAST results in table format and include percent identity
Count number of hits at 100% identity and same length as query for each query ID (not GI)
Write output that contains two columns: QUERY ID and number of hits:
# Number of hits at 100% identity, full length of query # QUERY ID Number Hits 1 23 2 21 3 20 4 14 5 12
Note
Learning Goals:
Full python program to count hits at 100% length and 100% identity per query:
#!/usr/bin/env python
"""
James Vincent
Mar 29 , 2012
parseBlastQueryHits.py
Version 1.0
Open a BLAST text file
loop over lines
split lines into fields
keep count of hits per query that are 100% length of query
and 100% identity
Output two column table with query ID, number of hits
"""
import sys
from collections import Counter
from Bio import Entrez
topHits = 10 # Number of top hits to output
minHitIdentity = 100.0 # Min percent idenity of hit to keep
Entrez.email = "jim@uvm.edu" # Always tell NCBI who you are
def getTaxonID(giNumber):
""" return taxon id from entrez given input gi number """
myTaxonList = Entrez.read(Entrez.elink(dbfrom="nucleotide",db="taxonomy",id=giNumber))
myTaxonID = myTaxonList[0]["LinkSetDb"][0]["Link"][0]["Id"]
return myTaxonID
def getTaxonGenusSpecies(taxonID):
""" return genus as string from Entrez given input taxon ID number """
handle = Entrez.read(Entrez.esummary(db="taxonomy", id=taxonID))
myGenus = handle[0]["Genus"]
mySpecies = handle[0]["Species"]
return myGenus,mySpecies
# -- BLAST parsing functions assume
# blast table format lines look like this
# qseqid sseqid qlen length pident evalue bitscore
#1 gi|343198516|ref|NR_042772.1| 100 100 100.00 1e-47 185
#1 gi|219878377|ref|NR_025516.1| 100 90 96.67 4e-37 150
def getQueryLength(blastLine):
""" return length of query from blast line """
myFields = blastLine.strip().split()
thisQueryLength = int(myFields[2])
return thisQueryLength
def getHitLength(blastLine):
""" return length of hit from blast line """
myFields = blastLine.strip().split()
thisHitLength = int(myFields[3])
return thisHitLength
def getIdent(blastLine):
""" parse blast output line and return percent identity of alignment """
myFields = blastLine.strip().split()
thisIdent = float(myFields[4])
return thisIdent
def getGI(blastLine):
""" parse blast output line and return GI of hit """
myFields = blastLine.strip().split()
# split subject (hit) ID field on |
myIDs = myFields[1].split("|")
thisGI= myIDs[1]
return thisGI
def getQueryID(blastLine):
""" parse blast output line and return query ID of hit """
myFields = blastLine.strip().split()
# split subject (hit) ID field on |
myQueryID = myFields[0]
return myQueryID
# Get input file name
myInfileName = sys.argv[1]
infile = open(myInfileName)
# Counter for hit GIs
#myGICounter = Counter()
myQueryCounter = Counter()
for thisLine in infile.readlines():
# Extract GI
# thisGI = getGI(thisLine)
# Extract query ID
thisQueryID = str(getQueryID(thisLine))
# Extract hit and query length
thisHitLength = getHitLength(thisLine)
thisQueryLength = getQueryLength(thisLine)
# Extract hit percent identity
thisIdent = getIdent(thisLine)
# Count this GI hit based on conditions
if (thisHitLength == thisQueryLength) and (thisIdent >= minHitIdentity):
myQueryCounter[thisQueryID] += 1
for thisKey in myQueryCounter.keys():
print thisKey,'\t',myQueryCounter[thisKey]
Submit multiple jobs to queue on lonestar:
#!/bin/bash
#$ -V # inherit shell environment
#$ -l h_rt=00:05:00 # wall time limit
#$ -q development # run in development queue - 1 hour max
#$ -pe 1way 12 # Number of nodes and cores
#$ -A TG-MCB110143 # TACC account
#$ -N Week11_Thurs # Name of job and stdout files
#$ -cwd # Run job from the directory it was submitted
#$ -j y
#------------------------
#
# James Vincent
# March 29, 2012
#
# Run blast on week9.fa vs 16SMicrobial database
# Contains 100 reads from 16S of s_mutans and 100 reads from m.komossens
# randomly distributed over 16S rRNA, 100bp length reads
# Reformat output to include Query seq-id, subject seq-id, score and e-value
#
#------------------------
# BLAST programs and variables
# TACC lonestar uses module system to provide blast
module load blast
module load python
# Database
DB=$WORK/JSCBIO2710/blast/databases/16SMicrobial
# Query
QUERY=week9.fa
OUTFILE=$QUERY.blast.asn
# BLAST output format: 11 is ASN, 6 is table no header
OUTFMT=11
# BLAST programs loaded by module command
BLASTN=blastn
BLAST_FORMATTER=blast_formatter
BLASTDBCMD=blastdbcmd
# ---!! Beware comments followed by $: #$ will be interpeted by qsub
# --- add extra space: # $
# Run blast
$BLASTN -db $DB -query $QUERY -outfmt $OUTFMT -out $OUTFILE
# Reformat ASN to hit custom hit table
$BLAST_FORMATTER -archive $OUTFILE -outfmt "6 qseqid sseqid qlen length pident evalue bitscore " -out $OUTFILE.table.pident
# Parse BLAST output with python program to get best hits
.parseBlast_v3.py $OUTFILE.table.pident
Rename jobs in queue script to keep track:
This line in qsub script:
#$ -N Week11_Thurs # Name of job and stdout files
is the name used to create the job output files:
Week11_Thurs.o522740
Week11_Thurs.po522740
Change the job name to clearly describe different jobs:
#$ -N Week11_Control1 # Name of job and stdout files
#$ -N Week11_Control2 # Name of job and stdout files
#$ -N Week11_Sample1 # Name of job and stdout files
Reading Exercises
Write the programs from this week (given above) on your own. Make sure you can write small functions and complete programs
Turn In:
Run BLAST jobs on the three FASTA files given below.
Copy the fasta files to your own working directory.
Modify your job script to keep track of which job you are running.
Convert the BLAST output to table format and inclue percent identity.
Parse the BLAST table output with your program (example given in Tues above).
to produce one line of output for each hit GI : GI, Count of hits, taxonID, genus, species
Turn in the resulting table for each of three FASTA files:
/work/00921/tg801771/JSCBIO2710/lectures/week11/Thurs/control1.fa
/work/00921/tg801771/JSCBIO2710/lectures/week11/Thurs/control2.fa
/work/00921/tg801771/JSCBIO2710/lectures/week11/Thurs/control3.fa