Note
Learning Goals:
User defined functions for parsing BLAST output:
#!/usr/bin/env python
"""
James Vincent
Mar 13 , 2012
parseBlast.py
Open a text file
loop over lines
split lines into fields
Sum numbers from certain field
"""
import sys
def getScore(blastLine):
""" parse blast output line and return score """
# BLAST input file has hit lines like this:
# fmt "6 qseqid sseqid evalue bitscore"
# 1 gi|219856848|ref|NR_024667.1| 0.0 2551
myFields = blastLine.strip().split()
thisScore = float(myFields[3])
return thisScore
def getGI(blastLine):
""" parse blast output line and return GI of hit """
# blast line looks like this:
# 1 gi|219856848|ref|NR_024667.1| 0.0 2551
# spit line into fields by white space
myFields = blastLine.strip().split()
# split subject (hit) ID field on |
myIDs = myFields[1].split("|")
thisGI= int(myIDs[1])
return thisGI
# Get file name
myInfileName = sys.argv[1]
infile = open(myInfileName)
mySum = 0.0
myCount = 0
# loop over each line in the file
for thisLine in infile.readlines():
thisScore = getScore(thisLine)
thisGI = getGI(thisLine)
print thisGI,thisScore
# Accumulate scores greater than 3
if thisScore > 2600:
# accumulate scores
mySum = mySum + thisScore
# count number of scores matching
myCount = myCount + 1
# Print sum, count and average
print "Sum is: ",mySum
print "Count is: ",myCount
if myCount > 0:
print "Average is: ",mySum/myCount
else:
print "No scores"
How to use python Counter:
#!/usr/bin/env python
"""
James Vincent
Mar 13 , 2012
myCounter.py
Demonstrate use of python Counter
http://docs.python.org/library/collections.html
"""
import sys
from collections import Counter
myCounter = Counter()
for word in ['red', 'blue', 'red', 'green', 'blue', 'blue','red', 'blue', 'red', 'green', 'blue', 'blue']:
myCounter[word] += 1
print myCounter
print "Top two most common: ", myCounter.most_common(2)
# Sum of all values:
print sum(myCounter.values())
Install BioPython:
login1$ cd $WORK
login1$ mkdir Software
login1$ cd Software
login1$ wget http://biopython.org/DIST/biopython-1.59.tar.gz
--2012-03-13 04:48:40-- http://biopython.org/DIST/biopython-1.59.tar.gz
Resolving biopython.org... 208.94.50.58
Connecting to biopython.org|208.94.50.58|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8577492 (8.2M) [application/x-gzip]
Saving to: `biopython-1.59.tar.gz'
100%[================================================================>] 8,577,492 1.42M/s in 6.1s
2012-03-13 04:48:46 (1.34 MB/s) - `biopython-1.59.tar.gz' saved [8577492/8577492]
login1$ tar zxf biopython-1.58.tar.gz
login1$ cd biopython-1.58
login1$ module load python
login1$ python ./setup.py install --user
running install
running build
running build_py
running build_ext
running install_lib
running install_egg_info
Removing /home1/00921/tg801771/.local/lib/python2.7/site-packages/biopython-1.58-py2.7.egg-info
Writing /home1/00921/tg801771/.local/lib/python2.7/site-packages/biopython-1.58-py2.7.egg-info
login1$
login1$ python
Python 2.7.1 (r271:86832, Sep 20 2011, 09:42:22)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import Entrez
>>>
Reading
Exercises
Turn In
write a function to parse each line and return the GI
use a Counter object to count how many times each GI number was seen
print the top five GI numbers and their counts at the end of the program
use week9.fa as the query file for BLAST
You can download week9.fa from this link
http://bit.ly/yoK27C
You can copy it directly to your lonestar account with the wget
program. Log in to lonestar and move to a directory where you
would like to place the file, then call wget:
wget <URL> -O <outputFileName>
Example:
login1$ cd /work/00921/tg801771/JSCBIO2710/lectures/week9/Thurs/
login1$ wget http://bit.ly/yoK27C -O week9.fa