JSC-BIO-2710

Data Intensive Computing for Applied Bioinformatics

Table Of Contents

Previous topic

Week Nine 13-Mar-12

This Page

Week Ten 20-Mar-12

Tuesday

Note

Learning Goals:

  • Use Counter to count top BLAST hits
  • Learn to use wget program
  • Install BioPython, make sure it works
  • Create sample BioPython program to call NCBI efetch utils




 
 
 
 
 
 
 
 
 

Video



Lecture

Use Counter object to count BLAST hits:

#!/usr/bin/env python

"""

James Vincent
March 20, 2012

Parse BLAST output in table format
Extract GI numbers and scores
Count number of times each hit GI occurs

"""

import sys
from collections import Counter


def getScore(oneLine):
   """  Extract the score valeu from this line
        Lines look like this: Score is last number.

        1        gi|265679037|ref|NR_029345.1|   0.0     2846
        1        gi|310975117|ref|NR_036981.1|   0.0     2819

    """

   myFields = oneLine.split()
   thisScore = float(myFields[3])

   return thisScore

def getGI(oneLine):
   """ Extract GI number from BLAST table format line

        1        gi|265679037|ref|NR_029345.1|   0.0     2846
        1        gi|310975117|ref|NR_036981.1|   0.0     2819
   """

   myFields = oneLine.split()
   myIDVals = myFields[1]
   myGI = myIDVals.split('|')[1]

   return myGI


# Get input BLAST formatted file
blastInfileName = sys.argv[1]
infile = open(blastInfileName)

# Counter for hit GIs
myGICounter = Counter()

# Loop over each line in the file
for thisLine in infile.readlines():
   # Extract score
   thisScore = getScore(thisLine)
   # Extract G
   thisGI  = getGI(thisLine)
   myGICounter[thisGI] += 1

# Print top five hits
print myGICounter.most_common(5)



Fetch files with wget:

You can download week9.fa from this link:

http://bit.ly/yoK27C

Copy directly to lonestar with  wget

Log in to lonestar and move to a directory where you
would like to place the file, then call wget:

wget <URL> -O <outputFileName>

Example:

login1$ cd /work/00921/tg801771/JSCBIO2710/lectures/week9/Thurs/
login1$  wget http://bit.ly/yoK27C -O week9.fa



Retrieve BioPython with wget:

Open a browser:  biopython.org

Find Download link, follow

Under Files look for latest source tarball version:  biopython-1.59.tar.gz 8,377 Kb -- Source Tarball

Hold mouse over the file download active link and copy the link address (ctrl-clik or two finger click...)

Move to your $WORK directory on lonestar, create a directory called Software and move into that

Retrieve the biopython download with wget:

wget http://biopython.org/DIST/biopython-1.59.tar.gz

Without any other options the file will be downloaded with the same name



Install BioPython:

login1$ cd $WORK
login1$ mkdir Software
login1$ cd Software
login1$ wget http://biopython.org/DIST/biopython-1.59.tar.gz

  --2012-03-13 04:48:40--  http://biopython.org/DIST/biopython-1.59.tar.gz
  Resolving biopython.org... 208.94.50.58
  Connecting to biopython.org|208.94.50.58|:80... connected.
  HTTP request sent, awaiting response... 200 OK
  Length: 8577492 (8.2M) [application/x-gzip]
  Saving to: `biopython-1.59.tar.gz'

  100%[================================================================>] 8,577,492   1.42M/s   in 6.1s

  2012-03-13 04:48:46 (1.34 MB/s) - `biopython-1.59.tar.gz' saved [8577492/8577492]

login1$ tar zxf biopython-1.58.tar.gz
login1$ cd biopython-1.58
login1$ module load python
login1$ python ./setup.py install --user

  running install
  running build
  running build_py
  running build_ext
  running install_lib
  running install_egg_info
  Removing /home1/00921/tg801771/.local/lib/python2.7/site-packages/biopython-1.58-py2.7.egg-info
  Writing /home1/00921/tg801771/.local/lib/python2.7/site-packages/biopython-1.58-py2.7.egg-info

login1$
login1$ python
Python 2.7.1 (r271:86832, Sep 20 2011, 09:42:22)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import Entrez
>>>

Sample eUtils program from Biopython:

#!/usr/bin/env python

"""

James Vincent
Mar 20 , 2012

efetchSample.py

Demonstrate use of BioPython efetch

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96

"""

import sys
from Bio import Entrez

Entrez.email = "jim@uvm.edu"     # Always tell NCBI who you are
handle = Entrez.efetch(db="nucleotide", id="186972394", rettype="gb", retmode="text")
print handle.read()

Homework

Reading

Biopython Chapter 8 Accessing NCBI’s Entrez databases: - http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96

Exercises

Turn In

Thursday




Video



Lecture

Note

Learning Goals:

  • Create simple BioPython program to call NCBI elink utils
  • Create simple BioPython program to call NCBI esummary utils
  • Use elink,esummary together to retrieve taxonomic names
  • Make taxonomic profile of top ten hits from BLAST week9 output

Build program to fetch taxonomy step by step:

* Write simple program that calls elink with one GI
* Write program that calls esummary with single taxon ID
* Convert simple programs into single function calls
* Add functions calls to existing BLAST parsing program

Use elink to fetch one taxon ID:

#!/usr/bin/env python

"""
James Vincent
Mar 22 , 2012

elink1.py

Demonstrate use of BioPython elink
to retrieve a taxonomic ID number from a GI number

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96
"""

import sys
import pprint
from Bio import Entrez

Entrez.email = "jim@uvm.edu"     # Always tell NCBI who you are

# Fetch one taxon ID from entrez given input gi number
handle = Entrez.elink(dbfrom="nucleotide",db="taxonomy",id="343198516")

print

# default is to return XML in an open file handle
for line in handle.readlines():
  print line,

print

Results:

login2$ ./elink1.py

<?xml version="1.0"?>
<!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD eLinkResult, 23 November 2010//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_101123.dtd">
<eLinkResult>

        <LinkSet>
                <DbFrom>nuccore</DbFrom>
                <IdList>
                        <Id>343198516</Id>
                </IdList>
                <LinkSetDb>
                        <DbTo>taxonomy</DbTo>
                        <LinkName>nuccore_taxonomy</LinkName>
                        <Link>
                                <Id>1309</Id>
                        </Link>
                </LinkSetDb>
        </LinkSet>
</eLinkResult>

Convert XML from Entrez into easy to use python data structure:

#!/usr/bin/env python

"""

James Vincent
Mar 22 , 2012

elink.py

Demonstrate use of BioPython elink
to retrieve a taxonomic ID number from a GI number

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96
"""

import sys
import pprint
from Bio import Entrez

Entrez.email = "jim@uvm.edu"     # Always tell NCBI who you are

# Fetch one taxon ID from entrez given input gi number
# default is to return XML in an open file handle
handle = Entrez.elink(dbfrom="nucleotide",db="taxonomy",id="343198516")

# Entrez.read call parses XML into python data structure
record = Entrez.read(handle)
print
print record
print
print
for thisKey in record[0].keys():
  print 'Key: ',thisKey, '\tValue: ',record[0][thisKey]

print '\n\n'

Results:

login2$ ./elink2.py


[{u'DbFrom': 'nuccore', u'IdList': ['343198516'], u'LinkSetDbHistory': [], u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '1309'}], u'LinkName': 'nuccore_taxonomy'}]}]


Key:  DbFrom        Value:  nuccore
Key:  IdList        Value:  ['343198516']
Key:  LinkSetDbHistory      Value:  []
Key:  LinkSetDb     Value:  [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '1309'}], u'LinkName': 'nuccore_taxonomy'}]

Extract taxon ID from python data structure:

#!/usr/bin/env python

"""

James Vincent
Mar 22 , 2012

elink3.py

Demonstrate use of BioPython elink
to retrieve a taxonomic ID number from a GI number

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96
"""

import sys
import pprint
from Bio import Entrez

Entrez.email = "jim@uvm.edu"     # Always tell NCBI who you are

# Fetch one taxon ID from entrez given input gi number
# default is to return XML in an open file handle
handle = Entrez.elink(dbfrom="nucleotide",db="taxonomy",id="343198516")

# Entrez.read call parses XML into python data structure
record = Entrez.read(handle)

myTaxonID = record[0]["LinkSetDb"][0]["Link"][0]["Id"]
print 'taxon ID: ',myTaxonID

Results:

login2$ ./elink3.py
taxon ID:  1309
login2$

Use esummary to fetch taxonomic names:

#!/usr/bin/env python

"""

James Vincent
Mar 22 , 2012

esummary1.py

Demonstrate use of BioPython Entrez esummary
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96
"""

import sys
from Bio import Entrez

Entrez.email = "jim@uvm.edu"     # Always tell NCBI who you are

# Get genus and species  as string from Entrez given input taxon ID number """

handle = Entrez.read(Entrez.esummary(db="taxonomy", id="1309"))
print
print handle
print

Results:

login2$ ./esummary1.py

[{'StructNumber': 101, 'Division': 'firmicutes', 'ProtNumber': 13860, 'CommonName': '', 'TaxId': 1309, 'Rank': 'species', 'Species': 'mutans', u'Item': [], 'GeneNumber': 4030, 'ScientificName': 'Streptococcus mutans', 'GenNumber': 4, 'Subsp': '', 'NucNumber': 13851, 'Genus': 'Streptococcus', u'Id': '1309'}]

Extract genus and species as strings from data structure:

#!/usr/bin/env python

"""

James Vincent
Mar 22 , 2012

esummary1.py

Demonstrate use of BioPython Entrez esummary
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96
"""

import sys
from Bio import Entrez

Entrez.email = "jim@uvm.edu"     # Always tell NCBI who you are

# Get genus and species  as string from Entrez given input taxon ID number """
myTaxonID = 1309
handle = Entrez.read(Entrez.esummary(db="taxonomy", id=myTaxonID))
myGenus = handle[0]["Genus"]
mySpecies = handle[0]["Species"]

print 'Taxon ID: ',myTaxonID,' Genus: ',myGenus,' Species: ',mySpecies

Results:

login2$ ./esummary2.py
Taxon ID:  1309  Genus:  Streptococcus  Species:  mutans
login2$

Homework

Reading

Biopython Chapter 8 Accessing NCBI’s Entrez databases: - http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96

Exercises

  • Write the programs given above for Thursday’s lecture

Turn In

Write a single python program based on BLAST parser for table formatted output:
  • write a function to parse each line and return the GI

  • use a Counter object to count how many times each GI number was seen

  • Write a function (based on the programs from Thurs lecture above)

    Given a GI number as input returns a taxon ID
  • Write a second function (based on the programs from Thurs lecture above)

    Given a taxon ID number as input returns genus and species as text strings
  • Extract the top five GI numbers and their counts and do the following:

    Extract just the GI number from your counter
    Fetch the taxon ID from the GI number
    Fetch genus and species from taxon ID
    print GI number, count of hits, genus, species