JSC-BIO-2710

Data Intensive Computing for Applied Bioinformatics

Table Of Contents

Previous topic

Week Seven 28-Feb-12

This Page

Week Eight 6-Mar-12

Tuesday

Note

Learning Goals:

  • Log in to lonestar.tacc.teragrid.org
  • recreate working directories
  • submit a job through qsub





 
 
 
 
 
 
 
 
 

Video




Lecture

Texas Advanced Computing Center: TACC

lonestar.tacc.teragrid.org

Log in to the TACC lonestar cluster lonestar.tacc.teragrid.org:

You should have received login details from XSEDE for your new account.

jjv5$  ssh tg801771@lonestar.tacc.teragrid.org

Make sure we are using the bash shell:

login1$ echo $SHELL
/bin/bash

# If needed we can change the defualt shell to bash:
login1$ chsh -l
/bin/sh
/bin/bash
/sbin/nologin
/bin/tcsh
/bin/csh
/bin/ksh
/bin/zsh




Recreate directory structure

Important

All files should be placed in $WORK directory

Create directory in $WORK:

login2$  cd $WORK
login2$  mkdir quiz homework projects
login2$ ls
homework  projects  quiz

Create any other directories as needed




Transfer files from AWS EC2 server to lonestar

Open a second terminal window:

# Log in to the EC2 server
$ ssh ec2-23-20-18-242.compute-1.amazonaws.com
jjv5@ec2-23-20-18-242.compute-1.amazonaws.com's password:

$ cd lectures/
$ ls
  week5
$ cd week5/
$ ls
  Thurs  Tues
$ cd Thurs/
$ ls

# use sftp to connect to lonestar
$ sftp tg801771@lonestar.tacc.teragrid.org
Connecting to lonestar.tacc.teragrid.org...
The authenticity of host 'lonestar.tacc.teragrid.org (129.114.53.21)' can't be established.
RSA key fingerprint is 5c:36:42:99:aa:2d:52:58:70:3a:20:c2:3a:33:e4:2f.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'lonestar.tacc.teragrid.org,129.114.53.21' (RSA) to the list of known hosts.
Password:


# transfer files as needed

sftp> cd lectures
sftp> cd week5
sftp> cd Thurs
sftp> lls
4            example2.py  example4.py  myNumbers.txt  runBlast.sh  week5.fa.blast.asn
example1.py  example3.py  example5.py  parseBlast.py  week5.fa
sftp> put runBlast.sh
Uploading runBlast.sh to /home1/00921/tg801771/lectures/week5/Thurs/runBlast.sh
runBlast.sh                                                 100%  815     0.8KB/s   00:00
sftp>




Use scp to transfer whole directories

Note

ftp (sftp) clients generally do not have a recursive option. It is difficult to transfer entire directories with an interactive ftp client.

Other methods include making a single tar file containing all files or using a transfer method that does support recursion.

wget, curl and scp support recursion.

For moving large files, Globus Online is preferred: https://www.globusonline.org/

Secure copy (scp) can recursively copy whole directories:

ip138067:~ jjv5$ ssh tg801771@lonestar.tacc.teragrid.org
Password:
Last login: Tue Mar  6 03:51:47 2012 from ip138067.uvm.edu
------------------------------------------------------------------------------
          Welcome to the Lonestar4 Westmere/QDR IB Linux Cluster
    Texas Advanced Computing Center, The University of Texas at Austin

------------------------ Disk quotas for user tg801771 ------------------------
| Disk         Usage (GB)     Limit    %Used   File Usage       Limit   %Used |
| /home1              1.1       1.1    98.11         1300     1001000    0.13 |
| /work              40.4     250.0    16.15        58255      500000   11.65 |
-------------------------------------------------------------------------------
login1$
login1$ cd $WORK
login1$ scp -r jjv5@ec2-23-20-18-242.compute-1.amazonaws.com:homework .
jjv5@ec2-23-20-18-242.compute-1.amazonaws.com's password:

Warning

scp will overwrite files by default, without warning

scp can be used to transfer files in either direction:

scp [[user@]host1:]file1 [...] [user@]host2:]file2

From this host, directory mydir, to other host:
   scp -r mydir user@otherhost:/tmp

From remote host, directory mydir, to here:
   scp -r  user@otherhost:mydir   .




Create a job script

Create the script runHello.sh shown below:

#!/bin/bash

#$ -pe 1way 12 #  12 cores per node - must take them all
#$ -q development # Queue name
#$ -N  helloWorld
#$ -A TG-MCB120034
#$ -V                      # inherit submission env
#$ -j y                   # combine stderr & stdout into stdout
#$ -o $JOB_NAME.o$JOB_ID  # Name of the output file (eg. myMPI.oJobID)
#$ -l h_rt=00:05:00       # Run time (hh:mm:ss)

#$ -M jjv5.jjv5@gmail.com
#$ -m bea

echo "Hello, I am running"
date
hostname

Submit the job to the development queue

The queue is specifiec in the job script itself:

qsub runHello.sh

Monitor the job with th qstat command:

login2$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
 479531 0.00000 helloWorld tg801771     qw    02/28/2012 05:10:32                                   12

Homework

Reading

Exercises

Turn In

  1. transfer file from AWS EC2 server to lonestar

    put everything in your $WORK directory
    bash shell scripts
    python programs
    16S blast database
  2. create a shell script to run on lonestar using qsub that does the following

    cd to your $WORK directory
    list all files
    print the date
  • be sure to use the proper qsub options and resource specifications in your script
  • use the development queue to make sure the job runs properly
  • when you are sure it runs correctly, change the queue name to ‘normal’
  • use qstat to monitor how long it takes before your job runs in the ‘normal’ queue
  • leave the script in your $WORK/week_8/Tues homework folder on lonestar
  1. Create a jobs script on lonestar that runs a BLAST job
  • use week5.fa file (from AWS EC2 server) as input query file
  • use 16SMicrobial database as the database
  • leave the script and output in your $WORK/week_8/Tues homework folder on lonestar



Thursday

Note

Learning Goals:

  • Create a complete qsub script for a BLAST job
  • Parse BLAST output with a python program using functions
  • Resubmit entire job with different parameters




Lecture



Create a job script

Add basic qsub parameters to an otherwise empty script:

#!/bin/bash

#$ -V                # inherit shell environment
#$ -l h_rt=00:05:00  # wall time limit
#$ -q development    # run in dev q
#$ -pe 1way 12
#$ -A TG-MCB120034

#$ -N Hello
#$ -cwd
#$ -j y

#$ -M jjv5.jjv5@gmail.com  # Mail address
#$ -m bea                  # send mail when job starts, stops or aborts


#module load blast

echo "Hello"



Warning

qsub recognizes #$ as meaningful. Make sure your commented lines do not begin with #$. For example: #$BLASTN -db ..... will cause qsub to interpret the line as an option string and thus fail. Put a space after the # to correct: # $BLASTN -db ...

Add comments describing tasks and variables needed:

#!/bin/bash

#$ -V                # inherit shell environment
#$ -l h_rt=00:05:00  # wall time limit
#$ -q development    # run in dev q
#$ -pe 1way 12
#$ -A TG-MCB120034

#$ -N Hello
#$ -cwd
#$ -j y

#------------------------
#
# James Vincent
# March 8, 2012
#
# Run blast on week5.fa vs 16SMicrobial database
# Reformat output to include Query seq-id, subject seq-id, score and e-value
#
#------------------------


# BLAST programs and variables

# TACC lonestar uses module system to provide blast
module load blast

# Database
DB=$WORK/JSCBIO2710/blast/databases/16SMicrobial

# Query
QUERY=week5.fa
OUTFILE=$QUERY.blast.asn

# BLAST output format: 11 is ASN, 6 is table no header
OUTFMT=11

# BLAST programs loaded by module command
BLASTN=blastn
BLAST_FORMATTER=blast_formatter
BLASTDBCMD=blastdbcmd

# Run blast
# $BLASTN -db $DB -query $QUERY -outfmt $OUTFMT -out $OUTFILE

# Reformat ASN to hit custom hit table
# $BLAST_FORMATTER -archive $OUTFILE -outfmt "6 qseqid sseqid evalue bitscore" -out $OUTFILE.table

# Parse BLAST output with python program to get best hits
# myParser.py  $OUTFILE.table

echo "Hello"



Create python script to parse BAST table output:

#!/usr/bin/env python

"""

James Vincent
Mar 8 , 2012

parseBlast.py

Open a text file
loop over lines
split lines into fields
Sum numbers from certain field

"""

import sys

# Get file name
myInfileName = sys.argv[1]

infile = open(myInfileName)

mySum = 0.0
myCount = 0
# loop over each line in the file
for thisLine in infile.readlines():

  # BLAST input file has hit lines like this:
  # fmt "6 qseqid sseqid evalue bitscore"
  # 1        gi|219856848|ref|NR_024667.1|   0.0     2551
  myFields = thisLine.strip().split()

  thisScore  = int(myFields[3])

  # Accumulate scores greater than 3
  if thisScore > 2600:
     # accumulate scores
     mySum = mySum + thisScore
     # count number of scores matching
     myCount = myCount + 1


# Print sum, count and average
print "Sum is: ",mySum
print "Count is: ",myCount
print "Average is: ",mySum/myCount



Create function to return score:

#!/usr/bin/env python

"""

James Vincent
Mar 8 , 2012

parseBlast.py

Open a text file
loop over lines
split lines into fields
Sum numbers from certain field

"""

import sys


def getScore(blastLine):
  """ parse blast output line and return score """

  # BLAST input file has hit lines like this:
  # fmt "6 qseqid sseqid evalue bitscore"
  # 1        gi|219856848|ref|NR_024667.1|   0.0     2551
  myFields = blastLine.strip().split()

  thisScore  = int(myFields[3])

  return thisScore



# Get file name
myInfileName = sys.argv[1]

infile = open(myInfileName)

mySum = 0.0
myCount = 0
# loop over each line in the file
for thisLine in infile.readlines():

  thisScore = getScore(thisLine)
  # Accumulate scores greater than 3
  if thisScore > 2600:
     # accumulate scores
     mySum = mySum + thisScore
     # count number of scores matching
     myCount = myCount + 1


# Print sum, count and average
print "Sum is: ",mySum
print "Count is: ",myCount
print "Average is: ",mySum/myCount

Homework

Reading

Exercises

  • Make sure you can write the shell scripts and python programs that we did in class
  • You should be able to write complete python programs from scratch
  • You should be able to write complete qsub scripts that work with some copying of qsub parameters

Turn In

  • Modify the parseBlast.py python program (last program shown above)
  • Add a function to return just the GI number from each line of BLAST output