JSC-BIO-2710

Data Intensive Computing for Applied Bioinformatics

Table Of Contents

Previous topic

Week Six 21-Feb-12

This Page

Week Seven 28-Feb-12

Tuesday

Note

Learning Goals:

  • Log in to lonestar.tacc.teragrid.org
  • recreate working directories
  • submit a job through qsub

Lecture

Texas Advanced Computing Center: TACC

lonestar.tacc.teragrid.org

Log in to the TACC lonestar cluster lonestar.tacc.teragrid.org:

You should have received login details from XSEDE for
your new account.

jjv5$  ssh tg801771@lonestar.tacc.teragrid.org

Make sure we are using the bash shell:

login1$ echo $SHELL
/bin/bash

# If needed we can change the defualt shell to bash:
login1$ chsh -l
/bin/sh
/bin/bash
/sbin/nologin
/bin/tcsh
/bin/csh
/bin/ksh
/bin/zsh

Recreate directory structure

Make new directories for quiz,homework,projects:

login2$  mkdir quiz homework projects
login2$ ls
homework  projects  quiz

Create any other directories as needed

Transfer files from AWS EC2 server to lonestar

Open a second terminal window:

# Log in to the EC2 server
$ ssh ec2-23-20-18-242.compute-1.amazonaws.com
jjv5@ec2-23-20-18-242.compute-1.amazonaws.com's password:

$ cd lectures/
$ ls
  week5
$ cd week5/
$ ls
  Thurs  Tues
$ cd Thurs/
$ ls

# use sftp to connect to lonestar
$ sftp tg801771@lonestar.tacc.teragrid.org
Connecting to lonestar.tacc.teragrid.org...
The authenticity of host 'lonestar.tacc.teragrid.org (129.114.53.21)' can't be established.
RSA key fingerprint is 5c:36:42:99:aa:2d:52:58:70:3a:20:c2:3a:33:e4:2f.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'lonestar.tacc.teragrid.org,129.114.53.21' (RSA) to the list of known hosts.
Password:


# transfer files as needed

sftp> cd lectures
sftp> cd week5
sftp> cd Thurs
sftp> lls
4            example2.py  example4.py  myNumbers.txt  runBlast.sh  week5.fa.blast.asn
example1.py  example3.py  example5.py  parseBlast.py  week5.fa
sftp> put runBlast.sh
Uploading runBlast.sh to /home1/00921/tg801771/lectures/week5/Thurs/runBlast.sh
runBlast.sh                                                 100%  815     0.8KB/s   00:00
sftp>

Create a job script

Create the script runHello.sh shown below:

#!/bin/bash

#$ -pe 1way 12 #  12 cores per node - must take them all
#$ -q development # Queue name
#$ -N  helloWorld
#$ -A TG-MCB120034
#$ -V                      # inherit submission env
#$ -j y                   # combine stderr & stdout into stdout
#$ -o $JOB_NAME.o$JOB_ID  # Name of the output file (eg. myMPI.oJobID)
#$ -l h_rt=00:05:00       # Run time (hh:mm:ss)

#$ -M jjv5.jjv5@gmail.com
#$ -m bea

echo "Hello, I am running"
date
hostname

Submit the job to the development queue

The queue is specifiec in the job script itself:

qsub runHello.sh

Monitor the job with th qstat command:

login2$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
 479531 0.00000 helloWorld tg801771     qw    02/28/2012 05:10:32                                   12

Homework

Reading

Exercises

Turn In

  • transfer the bash shell scripts, python programs and 16S blast database from the AWS EC2 server to lonestar

create a shell script to run on lonestar using qsub that does the following:

cd to your home directory
list all files
print the date
  • be sure to use the proper qsub options and resource specifications in your script
  • use the development queue to make sure the job runs properly
  • when you are sure it runs correctly, change the queue name to ‘normal’
  • use qstat to monitor how long it takes before your job runs in the ‘normal’ queue

Thursday

Note

Learning Goals:

  • Run BLAST from shell script
  • reformat and parse BLAST output
  • look up taxon ID from blast hit GI

Video

Lecture

Run BLAST vs 16S database:

#!/bin/bash

# James Vincent
# March 1, 2012
#
#
# Run blast on week5.fa vs 16SMicrobial database
# Reformat output to include Query seq-id, subject seq-id, score and e-value
#


# Database
DB=/home/jjv5/blast/databases/16SMicrobial

# Query
QUERY=week5.fa
OUTFILE=$QUERY.blast.asn


# BLAST output format: 11 is ASN, 6 is table no header
OUTFMT=11

# BLAST programs
BLASTN=/mnt/blast/ncbi-blast-2.2.25+/bin/blastn
BLAST_FORMATTER=/mnt/blast/ncbi-blast-2.2.25+/bin/blast_formatter
BLASTDBCMD=/mnt/blast/ncbi-blast-2.2.25+/bin/blastdbcmd

# Run blast
#$BLASTN -db $DB  -query $QUERY -outfmt $OUTFMT -out $OUTFILE

# Reformat ASN to hit custom hit table
#$BLAST_FORMATTER -archive $OUTFILE -outfmt "6 qseqid sseqid evalue bitscore" -out $OUTFILE.table
#$BLAST_FORMATTER -archive $OUTFILE -outfmt "6 sgi" -out $OUTFILE.table.sgi

# default format values: 'qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore'

#
for i in `cat week5.fa.blast.asn.table.sgi`
do
$BLASTDBCMD -db $DB -entry $i -outfmt "%g %T "
done

Parse BLAST output with python program:

#!/usr/bin/env python

"""

James Vincent
Mar 1 , 2012

parseBlast.py

Open a text file
loop over lines
split lines into fields
Sum numbers from certain field

"""

import sys

# Get file name
myInfileName = sys.argv[1]

infile = open(myInfileName)

mySum = 0.0
myCount = 0
# loop over each line in the file
for thisLine in infile.readlines():

  # BLAST input file has hit lines like this:
  # 1        gi|219856848|ref|NR_024667.1|   0.0     2551
  myFields = thisLine.strip().split()

  thisScore  = int(myFields[3])

  # Accumulate scores greater than 3
  if thisScore > 2600:
     # accumulate scores
     mySum = mySum + thisScore
     # count number of scores matching
     myCount = myCount + 1


# Print sum, count and average
print "Sum is: ",mySum
print "Count is: ",myCount
print "Average is: ",mySum/myCount

Use blastdbcmd to look up taxonid:

#!/bin/bash

# James Vincent
# March 1, 2012
#
#
# Get taxonimic ID of BLAST hit from GI number
#


# Database
DB=/home/jjv5/blast/databases/16SMicrobial

# Query
QUERY=week5.fa
OUTFILE=$QUERY.blast.asn.table

# BLAST programs
BLASTDBCMD=/mnt/blast/ncbi-blast-2.2.25+/bin/blastdbcmd

# Extract taxon ID

$BLASTDBCMD -db $DB -entry 310975189 -outfmt "%g %T "

Look up taxon ids from all blast hits:

#!/bin/bash

# James Vincent
# March 1, 2012
#
# Loop over all lines in BLAST output with hit GI
# Get taxonimic ID of BLAST hit from GI number
#


# Database
DB=/home/jjv5/blast/databases/16SMicrobial

# Query
QUERY=week5.fa
OUTFILE=$QUERY.blast.asn.table

# BLAST programs
BLASTDBCMD=/mnt/blast/ncbi-blast-2.2.25+/bin/blastdbcmd

# Extract taxon ID

for i in `cat $OUTFILE.sgi`
do
    $BLASTDBCMD -db $DB -entry $i -outfmt "%g %T "
done

Look up taxon id from python program:

Homework

Reading

Exercises

Turn In

  • Complete the exercises in http://bash.cyberciti.biz/guide/For_loop

  • Put each example in a separate shell script and leav them in your homework directory

  • Write a python program that parses BLAST output in table format:

    Use the input file week5.fa from class to run BLAST against 16S database Reformat the BLAST output into table format that includes the bitscore In your python program select only BLAST output hits that have a bitscore greater than 2000