JSC-BIO-2710

Data Intensive Computing for Applied Bioinformatics

Table Of Contents

Previous topic

Week Eleven 27-Mar-12

Next topic

Week Thirteen 17-Apr-12

This Page

Week Twelve 10-Apr-12

Tuesday

Note

Learning Goals:

  • Write and use Bash for loops
  • Write and use simple Bash functions
  • Create shell script to manage multiple qsub jobs



 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 

Video



Lecture

Bash for loops:

#!/bin/bash


#  James Vincent
#  Apr 10 , 2012
#
#  Loop over a sequence of numbers
#

# for variable in (sequence of items)
for thisNum in 1 2 3 4
do
  echo $thisNum
done

Results:

login1$ ./forLoop1.sh
1
2
3
4

** Loop over sequence of words**:

#!/bin/bash


#  James Vincent
#  Apr 10 , 2012
#
#  Loop over a sequence of words
#

# for variable in (sequence of items)
for thisWord in one two skip a few
do
  echo $thisWord
done

Results:

login1$ ./forLoop2.sh
one
two
skip
a
few

Simple python program to count lines:

#!/usr/bin/env python

"""

James Vincent
Apr 10, 2012

parsePercents.py

Open a text file
loop over lines containing single number
keep count of numbers greater than input minimum value

Input: prog.py <infile> <minPercent>
Output: minPercent and count
"""

import sys

# Get input file name
myInfileName = sys.argv[1]
infile = open(myInfileName)

# Get minimum percent
minHitIdentity = sys.argv[2]

# Count of numbers meeting min percent
myCount=0

for thisLine in infile.readlines():

   # lines contain a single floating point number
   thisNum = thisLine.strip().split()[0]

   # Count this number if meets minimum percent
   if (thisNum >= minHitIdentity):
      myCount += 1

print 'Minimum percent: ',minHitIdentity,'  Count: ',myCount

Create control file of numbers:

login1$ cat percent10.txt
100.0
100.0
95.5
90.5
90.5
85.5
85.5
79.0
79.0
79.0

Call python program within bash loop:

#!/bin/bash

#  James Vincent
#  Apr 10 , 2012
#
#  Loop over a sequence of numbers
#  Call python program  to parse file of numbers
#

INFILE=percent10.txt

# for variable in (sequence of items)
for minPercent in 70 80 90
do
  ./parsePercents.py $INFILE $minPercent
done

Results:

login1$ ./forLoop3.sh
Minimum percent:  70   Count:  8
Minimum percent:  80   Count:  5
Minimum percent:  90   Count:  3

Homework

Reading

Exercises

Make sure you can write the shell and python programs above (Tues) on your own

Write the programs from this week (given above) on your own. Make sure you can write small functions and complete programs

Turn In:

Modify your python program that parses BLAST output files

Have the python program accept a second parameter on the command line:
./parseBlast.py  <blastoutput.table.pident>  minPercent

The program should accept a blast file (as it has before) and now a
number for minimum percent identity of a hit

Modify your program to use this input minimum percent as the cutoff
for counting hits

Run your modified program on the control1.fa BLAST output
with minimum percent identity cutoffs of 100%,90% and 80%



Thursday

Note

Learning Goals:

  • Go over complete python program to parse BLAST hits
  • Use bash loop to iterate over parsing program parameters
  • Review directory structure for projects

Video


Lecture

Call python program within bash loop:

#!/bin/bash

#  James Vincent
#  Apr 12 , 2012
#
#  Loop over a sequence of numbers
#  Call python program  to parse file of numbers
#

INFILE=control1.fa.blast.asn.table.pident

# for variable in (sequence of items)
for minPercent in 70 80 90
do
  echo "----- Min % identity: " $minPercent " ----- "
  ./parseBlastPercent.py $INFILE $minPercent
done

Results for control1.fa BLAST output:

login1$ ./loopParse.sh
----- Min % identity:  70  -----
219856887 10 86187
343202424 10 518738 Shewanella vesiculosa
343200840 10 133448 Citrobacter youngae
310975079 10 74109 Photobacterium profundum
343201686 10 333962 Providencia heimbachae
----- Min % identity:  80  -----
219856887 10 86187
343202424 10 518738 Shewanella vesiculosa
343200840 10 133448 Citrobacter youngae
310975079 10 74109 Photobacterium profundum
343201686 10 333962 Providencia heimbachae
----- Min % identity:  90  -----
219856887 10 86187
343202424 10 518738 Shewanella vesiculosa
343200840 10 133448 Citrobacter youngae
310975079 10 74109 Photobacterium profundum
343201122 10 29486 Yersinia ruckeri

Complete Python program to parse BLAST hits with min identity from command line:

#!/usr/bin/env python

"""

James Vincent
Apr 12, 2012

simpleBlast.py

Open a BLAST text file
loop over lines
keep count of hits per GI
split lines into fields
if minimum hit percent identity, count hit
for all counted hits:
Extract GI number from each line
Convert GI to taxon ID number
convert taxon ID to genus and species text strings

Output: top five hits by count of GI with genus and species for each
"""

import sys
from collections import Counter
from Bio import Entrez



Entrez.email = "jim@uvm.edu"  # Always tell NCBI who you are

def getTaxonID(giNumber):
   """ return taxon id from entrez given input gi number """

   myTaxonList = Entrez.read(Entrez.elink(dbfrom="nucleotide",db="taxonomy",id=giNumber))
   myTaxonID = myTaxonList[0]["LinkSetDb"][0]["Link"][0]["Id"]

   return myTaxonID


def getTaxonGenusSpecies(taxonID):
   """ return genus as string from Entrez given input taxon ID number """

   handle = Entrez.read(Entrez.esummary(db="taxonomy", id=taxonID))
   myGenus = handle[0]["Genus"]
   mySpecies = handle[0]["Species"]

   return myGenus,mySpecies

#  -- BLAST parsing functions assume
#     blast table format lines look like this
# qseqid sseqid qlen length pident evalue bitscore
#1   gi|343198516|ref|NR_042772.1|   100     100     100.00  1e-47    185
#1   gi|219878377|ref|NR_025516.1|   100     90      96.67   4e-37    150

def getQueryLength(blastLine):
   """ return length of query from blast line """

   myFields = blastLine.strip().split()

   thisQueryLength  = int(myFields[2])

   return thisQueryLength


def getHitLength(blastLine):
   """ return length of hit from blast line """

   myFields = blastLine.strip().split()

   thisHitLength  = int(myFields[3])

   return thisHitLength


def getIdent(blastLine):
   """ parse blast output line and return percent identity of alignment """

   myFields = blastLine.strip().split()

   thisIdent  = float(myFields[4])

   return thisIdent


def getGI(blastLine):
   """ parse blast output line and return GI of hit """

   myFields = blastLine.strip().split()

   # split subject (hit) ID field on |
   myIDs = myFields[1].split("|")
   thisGI= myIDs[1]

   return thisGI


# Get input file name
myInfileName = sys.argv[1]
infile = open(myInfileName)

topHits = 5                  # Number of top hits to output
minHitIdentity = float(sys.argv[2])         # Min percent idenity of hit to keep


# Counter for hit GIs
myGICounter = Counter()

for thisLine in infile.readlines():

   # Extract GI
   thisGI  = getGI(thisLine)

   # Extract hit  and query length
   thisHitLength  = getHitLength(thisLine)
   thisQueryLength  = getQueryLength(thisLine)

   # Extract hit percent identity
   thisIdent  = getIdent(thisLine)

   # Count this GI hit based on conditions
   #if (thisHitLength == thisQueryLength)  and (thisIdent >= minHitIdentity):

   if (thisIdent >= minHitIdentity):
      myGICounter[thisGI] += 1



# Print top five hits
# print myGICounter.most_common(5)
for thisItem in myGICounter.most_common(topHits):
   thisGI = thisItem[0]
   thisCount = thisItem[1]

   # Get taxon ID
   taxonID = getTaxonID(thisGI)
   # Get genus, species
   genus,species =  getTaxonGenusSpecies(taxonID)
   print thisGI,thisCount,taxonID,genus,species

Homework

Reading

Exercises

Turn In:

Run BLAST on the three control files below.
Use your BLAST parsing program to parse the hits at several
minimum lengths and percent identity matches.

Modify the BLAST e-value cutoff parameter when running BLAST until you
receive hits of some kind.

Summarize what you have found.

/work/00921/tg801771/JSCBIO2710/lectures/week12/Thurs/negativeControl1.fa
/work/00921/tg801771/JSCBIO2710/lectures/week12/Thurs/negativeControl2.fa
/work/00921/tg801771/JSCBIO2710/lectures/week12/Thurs/negativeControl3.fa