Note
Learning Goals:
Texas Advanced Computing Center: TACC
lonestar.tacc.teragrid.org
Log in to the TACC lonestar cluster lonestar.tacc.teragrid.org:
You should have received login details from XSEDE for
your new account.
jjv5$ ssh tg801771@lonestar.tacc.teragrid.org
Make sure we are using the bash shell:
login1$ echo $SHELL
/bin/bash
# If needed we can change the defualt shell to bash:
login1$ chsh -l
/bin/sh
/bin/bash
/sbin/nologin
/bin/tcsh
/bin/csh
/bin/ksh
/bin/zsh
Recreate directory structure
Make new directories for quiz,homework,projects:
login2$ mkdir quiz homework projects
login2$ ls
homework projects quiz
Create any other directories as needed
Transfer files from AWS EC2 server to lonestar
Open a second terminal window:
# Log in to the EC2 server
$ ssh ec2-23-20-18-242.compute-1.amazonaws.com
jjv5@ec2-23-20-18-242.compute-1.amazonaws.com's password:
$ cd lectures/
$ ls
week5
$ cd week5/
$ ls
Thurs Tues
$ cd Thurs/
$ ls
# use sftp to connect to lonestar
$ sftp tg801771@lonestar.tacc.teragrid.org
Connecting to lonestar.tacc.teragrid.org...
The authenticity of host 'lonestar.tacc.teragrid.org (129.114.53.21)' can't be established.
RSA key fingerprint is 5c:36:42:99:aa:2d:52:58:70:3a:20:c2:3a:33:e4:2f.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'lonestar.tacc.teragrid.org,129.114.53.21' (RSA) to the list of known hosts.
Password:
# transfer files as needed
sftp> cd lectures
sftp> cd week5
sftp> cd Thurs
sftp> lls
4 example2.py example4.py myNumbers.txt runBlast.sh week5.fa.blast.asn
example1.py example3.py example5.py parseBlast.py week5.fa
sftp> put runBlast.sh
Uploading runBlast.sh to /home1/00921/tg801771/lectures/week5/Thurs/runBlast.sh
runBlast.sh 100% 815 0.8KB/s 00:00
sftp>
Create a job script
Create the script runHello.sh shown below:
#!/bin/bash
#$ -pe 1way 12 # 12 cores per node - must take them all
#$ -q development # Queue name
#$ -N helloWorld
#$ -A TG-MCB120034
#$ -V # inherit submission env
#$ -j y # combine stderr & stdout into stdout
#$ -o $JOB_NAME.o$JOB_ID # Name of the output file (eg. myMPI.oJobID)
#$ -l h_rt=00:05:00 # Run time (hh:mm:ss)
#$ -M jjv5.jjv5@gmail.com
#$ -m bea
echo "Hello, I am running"
date
hostname
Submit the job to the development queue
The queue is specifiec in the job script itself:
qsub runHello.sh
Monitor the job with th qstat command:
login2$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
479531 0.00000 helloWorld tg801771 qw 02/28/2012 05:10:32 12
Reading
Exercises
Turn In
create a shell script to run on lonestar using qsub that does the following:
cd to your home directory
list all files
print the date
Note
Learning Goals:
Run BLAST vs 16S database:
#!/bin/bash
# James Vincent
# March 1, 2012
#
#
# Run blast on week5.fa vs 16SMicrobial database
# Reformat output to include Query seq-id, subject seq-id, score and e-value
#
# Database
DB=/home/jjv5/blast/databases/16SMicrobial
# Query
QUERY=week5.fa
OUTFILE=$QUERY.blast.asn
# BLAST output format: 11 is ASN, 6 is table no header
OUTFMT=11
# BLAST programs
BLASTN=/mnt/blast/ncbi-blast-2.2.25+/bin/blastn
BLAST_FORMATTER=/mnt/blast/ncbi-blast-2.2.25+/bin/blast_formatter
BLASTDBCMD=/mnt/blast/ncbi-blast-2.2.25+/bin/blastdbcmd
# Run blast
#$BLASTN -db $DB -query $QUERY -outfmt $OUTFMT -out $OUTFILE
# Reformat ASN to hit custom hit table
#$BLAST_FORMATTER -archive $OUTFILE -outfmt "6 qseqid sseqid evalue bitscore" -out $OUTFILE.table
#$BLAST_FORMATTER -archive $OUTFILE -outfmt "6 sgi" -out $OUTFILE.table.sgi
# default format values: 'qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore'
#
for i in `cat week5.fa.blast.asn.table.sgi`
do
$BLASTDBCMD -db $DB -entry $i -outfmt "%g %T "
done
Parse BLAST output with python program:
#!/usr/bin/env python
"""
James Vincent
Mar 1 , 2012
parseBlast.py
Open a text file
loop over lines
split lines into fields
Sum numbers from certain field
"""
import sys
# Get file name
myInfileName = sys.argv[1]
infile = open(myInfileName)
mySum = 0.0
myCount = 0
# loop over each line in the file
for thisLine in infile.readlines():
# BLAST input file has hit lines like this:
# 1 gi|219856848|ref|NR_024667.1| 0.0 2551
myFields = thisLine.strip().split()
thisScore = int(myFields[3])
# Accumulate scores greater than 3
if thisScore > 2600:
# accumulate scores
mySum = mySum + thisScore
# count number of scores matching
myCount = myCount + 1
# Print sum, count and average
print "Sum is: ",mySum
print "Count is: ",myCount
print "Average is: ",mySum/myCount
Use blastdbcmd to look up taxonid:
#!/bin/bash
# James Vincent
# March 1, 2012
#
#
# Get taxonimic ID of BLAST hit from GI number
#
# Database
DB=/home/jjv5/blast/databases/16SMicrobial
# Query
QUERY=week5.fa
OUTFILE=$QUERY.blast.asn.table
# BLAST programs
BLASTDBCMD=/mnt/blast/ncbi-blast-2.2.25+/bin/blastdbcmd
# Extract taxon ID
$BLASTDBCMD -db $DB -entry 310975189 -outfmt "%g %T "
Look up taxon ids from all blast hits:
#!/bin/bash
# James Vincent
# March 1, 2012
#
# Loop over all lines in BLAST output with hit GI
# Get taxonimic ID of BLAST hit from GI number
#
# Database
DB=/home/jjv5/blast/databases/16SMicrobial
# Query
QUERY=week5.fa
OUTFILE=$QUERY.blast.asn.table
# BLAST programs
BLASTDBCMD=/mnt/blast/ncbi-blast-2.2.25+/bin/blastdbcmd
# Extract taxon ID
for i in `cat $OUTFILE.sgi`
do
$BLASTDBCMD -db $DB -entry $i -outfmt "%g %T "
done
Look up taxon id from python program:
Reading
Exercises
Turn In
Complete the exercises in http://bash.cyberciti.biz/guide/For_loop
Put each example in a separate shell script and leav them in your homework directory
Write a python program that parses BLAST output in table format:
Use the input file week5.fa from class to run BLAST against 16S database Reformat the BLAST output into table format that includes the bitscore In your python program select only BLAST output hits that have a bitscore greater than 2000