Note
Learning Goals:
Use Counter object to count BLAST hits:
#!/usr/bin/env python
"""
James Vincent
March 20, 2012
Parse BLAST output in table format
Extract GI numbers and scores
Count number of times each hit GI occurs
"""
import sys
from collections import Counter
def getScore(oneLine):
""" Extract the score valeu from this line
Lines look like this: Score is last number.
1 gi|265679037|ref|NR_029345.1| 0.0 2846
1 gi|310975117|ref|NR_036981.1| 0.0 2819
"""
myFields = oneLine.split()
thisScore = float(myFields[3])
return thisScore
def getGI(oneLine):
""" Extract GI number from BLAST table format line
1 gi|265679037|ref|NR_029345.1| 0.0 2846
1 gi|310975117|ref|NR_036981.1| 0.0 2819
"""
myFields = oneLine.split()
myIDVals = myFields[1]
myGI = myIDVals.split('|')[1]
return myGI
# Get input BLAST formatted file
blastInfileName = sys.argv[1]
infile = open(blastInfileName)
# Counter for hit GIs
myGICounter = Counter()
# Loop over each line in the file
for thisLine in infile.readlines():
# Extract score
thisScore = getScore(thisLine)
# Extract G
thisGI = getGI(thisLine)
myGICounter[thisGI] += 1
# Print top five hits
print myGICounter.most_common(5)
Fetch files with wget:
You can download week9.fa from this link:
http://bit.ly/yoK27C
Copy directly to lonestar with wget
Log in to lonestar and move to a directory where you
would like to place the file, then call wget:
wget <URL> -O <outputFileName>
Example:
login1$ cd /work/00921/tg801771/JSCBIO2710/lectures/week9/Thurs/
login1$ wget http://bit.ly/yoK27C -O week9.fa
Retrieve BioPython with wget:
Open a browser: biopython.org
Find Download link, follow
Under Files look for latest source tarball version: biopython-1.59.tar.gz 8,377 Kb -- Source Tarball
Hold mouse over the file download active link and copy the link address (ctrl-clik or two finger click...)
Move to your $WORK directory on lonestar, create a directory called Software and move into that
Retrieve the biopython download with wget:
wget http://biopython.org/DIST/biopython-1.59.tar.gz
Without any other options the file will be downloaded with the same name
Install BioPython:
login1$ cd $WORK
login1$ mkdir Software
login1$ cd Software
login1$ wget http://biopython.org/DIST/biopython-1.59.tar.gz
--2012-03-13 04:48:40-- http://biopython.org/DIST/biopython-1.59.tar.gz
Resolving biopython.org... 208.94.50.58
Connecting to biopython.org|208.94.50.58|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8577492 (8.2M) [application/x-gzip]
Saving to: `biopython-1.59.tar.gz'
100%[================================================================>] 8,577,492 1.42M/s in 6.1s
2012-03-13 04:48:46 (1.34 MB/s) - `biopython-1.59.tar.gz' saved [8577492/8577492]
login1$ tar zxf biopython-1.58.tar.gz
login1$ cd biopython-1.58
login1$ module load python
login1$ python ./setup.py install --user
running install
running build
running build_py
running build_ext
running install_lib
running install_egg_info
Removing /home1/00921/tg801771/.local/lib/python2.7/site-packages/biopython-1.58-py2.7.egg-info
Writing /home1/00921/tg801771/.local/lib/python2.7/site-packages/biopython-1.58-py2.7.egg-info
login1$
login1$ python
Python 2.7.1 (r271:86832, Sep 20 2011, 09:42:22)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import Entrez
>>>
Sample eUtils program from Biopython:
#!/usr/bin/env python
"""
James Vincent
Mar 20 , 2012
efetchSample.py
Demonstrate use of BioPython efetch
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96
"""
import sys
from Bio import Entrez
Entrez.email = "jim@uvm.edu" # Always tell NCBI who you are
handle = Entrez.efetch(db="nucleotide", id="186972394", rettype="gb", retmode="text")
print handle.read()
Reading
Biopython Chapter 8 Accessing NCBI’s Entrez databases: - http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96
Exercises
Turn In
Note
Learning Goals:
Build program to fetch taxonomy step by step:
* Write simple program that calls elink with one GI
* Write program that calls esummary with single taxon ID
* Convert simple programs into single function calls
* Add functions calls to existing BLAST parsing program
Use elink to fetch one taxon ID:
#!/usr/bin/env python
"""
James Vincent
Mar 22 , 2012
elink1.py
Demonstrate use of BioPython elink
to retrieve a taxonomic ID number from a GI number
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96
"""
import sys
import pprint
from Bio import Entrez
Entrez.email = "jim@uvm.edu" # Always tell NCBI who you are
# Fetch one taxon ID from entrez given input gi number
handle = Entrez.elink(dbfrom="nucleotide",db="taxonomy",id="343198516")
print
# default is to return XML in an open file handle
for line in handle.readlines():
print line,
print
Results:
login2$ ./elink1.py
<?xml version="1.0"?>
<!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD eLinkResult, 23 November 2010//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_101123.dtd">
<eLinkResult>
<LinkSet>
<DbFrom>nuccore</DbFrom>
<IdList>
<Id>343198516</Id>
</IdList>
<LinkSetDb>
<DbTo>taxonomy</DbTo>
<LinkName>nuccore_taxonomy</LinkName>
<Link>
<Id>1309</Id>
</Link>
</LinkSetDb>
</LinkSet>
</eLinkResult>
Convert XML from Entrez into easy to use python data structure:
#!/usr/bin/env python
"""
James Vincent
Mar 22 , 2012
elink.py
Demonstrate use of BioPython elink
to retrieve a taxonomic ID number from a GI number
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96
"""
import sys
import pprint
from Bio import Entrez
Entrez.email = "jim@uvm.edu" # Always tell NCBI who you are
# Fetch one taxon ID from entrez given input gi number
# default is to return XML in an open file handle
handle = Entrez.elink(dbfrom="nucleotide",db="taxonomy",id="343198516")
# Entrez.read call parses XML into python data structure
record = Entrez.read(handle)
print
print record
print
print
for thisKey in record[0].keys():
print 'Key: ',thisKey, '\tValue: ',record[0][thisKey]
print '\n\n'
Results:
login2$ ./elink2.py
[{u'DbFrom': 'nuccore', u'IdList': ['343198516'], u'LinkSetDbHistory': [], u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '1309'}], u'LinkName': 'nuccore_taxonomy'}]}]
Key: DbFrom Value: nuccore
Key: IdList Value: ['343198516']
Key: LinkSetDbHistory Value: []
Key: LinkSetDb Value: [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '1309'}], u'LinkName': 'nuccore_taxonomy'}]
Extract taxon ID from python data structure:
#!/usr/bin/env python
"""
James Vincent
Mar 22 , 2012
elink3.py
Demonstrate use of BioPython elink
to retrieve a taxonomic ID number from a GI number
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96
"""
import sys
import pprint
from Bio import Entrez
Entrez.email = "jim@uvm.edu" # Always tell NCBI who you are
# Fetch one taxon ID from entrez given input gi number
# default is to return XML in an open file handle
handle = Entrez.elink(dbfrom="nucleotide",db="taxonomy",id="343198516")
# Entrez.read call parses XML into python data structure
record = Entrez.read(handle)
myTaxonID = record[0]["LinkSetDb"][0]["Link"][0]["Id"]
print 'taxon ID: ',myTaxonID
Results:
login2$ ./elink3.py
taxon ID: 1309
login2$
Use esummary to fetch taxonomic names:
#!/usr/bin/env python
"""
James Vincent
Mar 22 , 2012
esummary1.py
Demonstrate use of BioPython Entrez esummary
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96
"""
import sys
from Bio import Entrez
Entrez.email = "jim@uvm.edu" # Always tell NCBI who you are
# Get genus and species as string from Entrez given input taxon ID number """
handle = Entrez.read(Entrez.esummary(db="taxonomy", id="1309"))
print
print handle
print
Results:
login2$ ./esummary1.py
[{'StructNumber': 101, 'Division': 'firmicutes', 'ProtNumber': 13860, 'CommonName': '', 'TaxId': 1309, 'Rank': 'species', 'Species': 'mutans', u'Item': [], 'GeneNumber': 4030, 'ScientificName': 'Streptococcus mutans', 'GenNumber': 4, 'Subsp': '', 'NucNumber': 13851, 'Genus': 'Streptococcus', u'Id': '1309'}]
Extract genus and species as strings from data structure:
#!/usr/bin/env python
"""
James Vincent
Mar 22 , 2012
esummary1.py
Demonstrate use of BioPython Entrez esummary
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96
"""
import sys
from Bio import Entrez
Entrez.email = "jim@uvm.edu" # Always tell NCBI who you are
# Get genus and species as string from Entrez given input taxon ID number """
myTaxonID = 1309
handle = Entrez.read(Entrez.esummary(db="taxonomy", id=myTaxonID))
myGenus = handle[0]["Genus"]
mySpecies = handle[0]["Species"]
print 'Taxon ID: ',myTaxonID,' Genus: ',myGenus,' Species: ',mySpecies
Results:
login2$ ./esummary2.py
Taxon ID: 1309 Genus: Streptococcus Species: mutans
login2$
Reading
Biopython Chapter 8 Accessing NCBI’s Entrez databases: - http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96
Exercises
Turn In
write a function to parse each line and return the GI
use a Counter object to count how many times each GI number was seen
Write a function (based on the programs from Thurs lecture above)
Given a GI number as input returns a taxon ID
Write a second function (based on the programs from Thurs lecture above)
Given a taxon ID number as input returns genus and species as text strings
Extract the top five GI numbers and their counts and do the following:
Extract just the GI number from your counter
Fetch the taxon ID from the GI number
Fetch genus and species from taxon ID
print GI number, count of hits, genus, species