Note
Learning Goals:
Bash for loops:
#!/bin/bash
# James Vincent
# Apr 10 , 2012
#
# Loop over a sequence of numbers
#
# for variable in (sequence of items)
for thisNum in 1 2 3 4
do
echo $thisNum
done
Results:
login1$ ./forLoop1.sh
1
2
3
4
** Loop over sequence of words**:
#!/bin/bash
# James Vincent
# Apr 10 , 2012
#
# Loop over a sequence of words
#
# for variable in (sequence of items)
for thisWord in one two skip a few
do
echo $thisWord
done
Results:
login1$ ./forLoop2.sh
one
two
skip
a
few
Simple python program to count lines:
#!/usr/bin/env python
"""
James Vincent
Apr 10, 2012
parsePercents.py
Open a text file
loop over lines containing single number
keep count of numbers greater than input minimum value
Input: prog.py <infile> <minPercent>
Output: minPercent and count
"""
import sys
# Get input file name
myInfileName = sys.argv[1]
infile = open(myInfileName)
# Get minimum percent
minHitIdentity = sys.argv[2]
# Count of numbers meeting min percent
myCount=0
for thisLine in infile.readlines():
# lines contain a single floating point number
thisNum = thisLine.strip().split()[0]
# Count this number if meets minimum percent
if (thisNum >= minHitIdentity):
myCount += 1
print 'Minimum percent: ',minHitIdentity,' Count: ',myCount
Create control file of numbers:
login1$ cat percent10.txt
100.0
100.0
95.5
90.5
90.5
85.5
85.5
79.0
79.0
79.0
Call python program within bash loop:
#!/bin/bash
# James Vincent
# Apr 10 , 2012
#
# Loop over a sequence of numbers
# Call python program to parse file of numbers
#
INFILE=percent10.txt
# for variable in (sequence of items)
for minPercent in 70 80 90
do
./parsePercents.py $INFILE $minPercent
done
Results:
login1$ ./forLoop3.sh
Minimum percent: 70 Count: 8
Minimum percent: 80 Count: 5
Minimum percent: 90 Count: 3
Reading
Exercises
Make sure you can write the shell and python programs above (Tues) on your own
Write the programs from this week (given above) on your own. Make sure you can write small functions and complete programs
Turn In:
Modify your python program that parses BLAST output files
Have the python program accept a second parameter on the command line:
./parseBlast.py <blastoutput.table.pident> minPercent
The program should accept a blast file (as it has before) and now a
number for minimum percent identity of a hit
Modify your program to use this input minimum percent as the cutoff
for counting hits
Run your modified program on the control1.fa BLAST output
with minimum percent identity cutoffs of 100%,90% and 80%
Note
Learning Goals:
Call python program within bash loop:
#!/bin/bash
# James Vincent
# Apr 12 , 2012
#
# Loop over a sequence of numbers
# Call python program to parse file of numbers
#
INFILE=control1.fa.blast.asn.table.pident
# for variable in (sequence of items)
for minPercent in 70 80 90
do
echo "----- Min % identity: " $minPercent " ----- "
./parseBlastPercent.py $INFILE $minPercent
done
Results for control1.fa BLAST output:
login1$ ./loopParse.sh
----- Min % identity: 70 -----
219856887 10 86187
343202424 10 518738 Shewanella vesiculosa
343200840 10 133448 Citrobacter youngae
310975079 10 74109 Photobacterium profundum
343201686 10 333962 Providencia heimbachae
----- Min % identity: 80 -----
219856887 10 86187
343202424 10 518738 Shewanella vesiculosa
343200840 10 133448 Citrobacter youngae
310975079 10 74109 Photobacterium profundum
343201686 10 333962 Providencia heimbachae
----- Min % identity: 90 -----
219856887 10 86187
343202424 10 518738 Shewanella vesiculosa
343200840 10 133448 Citrobacter youngae
310975079 10 74109 Photobacterium profundum
343201122 10 29486 Yersinia ruckeri
Complete Python program to parse BLAST hits with min identity from command line:
#!/usr/bin/env python
"""
James Vincent
Apr 12, 2012
simpleBlast.py
Open a BLAST text file
loop over lines
keep count of hits per GI
split lines into fields
if minimum hit percent identity, count hit
for all counted hits:
Extract GI number from each line
Convert GI to taxon ID number
convert taxon ID to genus and species text strings
Output: top five hits by count of GI with genus and species for each
"""
import sys
from collections import Counter
from Bio import Entrez
Entrez.email = "jim@uvm.edu" # Always tell NCBI who you are
def getTaxonID(giNumber):
""" return taxon id from entrez given input gi number """
myTaxonList = Entrez.read(Entrez.elink(dbfrom="nucleotide",db="taxonomy",id=giNumber))
myTaxonID = myTaxonList[0]["LinkSetDb"][0]["Link"][0]["Id"]
return myTaxonID
def getTaxonGenusSpecies(taxonID):
""" return genus as string from Entrez given input taxon ID number """
handle = Entrez.read(Entrez.esummary(db="taxonomy", id=taxonID))
myGenus = handle[0]["Genus"]
mySpecies = handle[0]["Species"]
return myGenus,mySpecies
# -- BLAST parsing functions assume
# blast table format lines look like this
# qseqid sseqid qlen length pident evalue bitscore
#1 gi|343198516|ref|NR_042772.1| 100 100 100.00 1e-47 185
#1 gi|219878377|ref|NR_025516.1| 100 90 96.67 4e-37 150
def getQueryLength(blastLine):
""" return length of query from blast line """
myFields = blastLine.strip().split()
thisQueryLength = int(myFields[2])
return thisQueryLength
def getHitLength(blastLine):
""" return length of hit from blast line """
myFields = blastLine.strip().split()
thisHitLength = int(myFields[3])
return thisHitLength
def getIdent(blastLine):
""" parse blast output line and return percent identity of alignment """
myFields = blastLine.strip().split()
thisIdent = float(myFields[4])
return thisIdent
def getGI(blastLine):
""" parse blast output line and return GI of hit """
myFields = blastLine.strip().split()
# split subject (hit) ID field on |
myIDs = myFields[1].split("|")
thisGI= myIDs[1]
return thisGI
# Get input file name
myInfileName = sys.argv[1]
infile = open(myInfileName)
topHits = 5 # Number of top hits to output
minHitIdentity = float(sys.argv[2]) # Min percent idenity of hit to keep
# Counter for hit GIs
myGICounter = Counter()
for thisLine in infile.readlines():
# Extract GI
thisGI = getGI(thisLine)
# Extract hit and query length
thisHitLength = getHitLength(thisLine)
thisQueryLength = getQueryLength(thisLine)
# Extract hit percent identity
thisIdent = getIdent(thisLine)
# Count this GI hit based on conditions
#if (thisHitLength == thisQueryLength) and (thisIdent >= minHitIdentity):
if (thisIdent >= minHitIdentity):
myGICounter[thisGI] += 1
# Print top five hits
# print myGICounter.most_common(5)
for thisItem in myGICounter.most_common(topHits):
thisGI = thisItem[0]
thisCount = thisItem[1]
# Get taxon ID
taxonID = getTaxonID(thisGI)
# Get genus, species
genus,species = getTaxonGenusSpecies(taxonID)
print thisGI,thisCount,taxonID,genus,species
Reading
Exercises
Turn In:
Run BLAST on the three control files below.
Use your BLAST parsing program to parse the hits at several
minimum lengths and percent identity matches.
Modify the BLAST e-value cutoff parameter when running BLAST until you
receive hits of some kind.
Summarize what you have found.
/work/00921/tg801771/JSCBIO2710/lectures/week12/Thurs/negativeControl1.fa
/work/00921/tg801771/JSCBIO2710/lectures/week12/Thurs/negativeControl2.fa
/work/00921/tg801771/JSCBIO2710/lectures/week12/Thurs/negativeControl3.fa