JSC-BIO-2710

Data Intensive Computing for Applied Bioinformatics

Table Of Contents

Previous topic

Week Three 31-Jan-12

This Page

Week Four 07-Feb-12

Tuesday

Note

Learning Goals:

  • structure of a python program
  • python comments
  • running python programs from a shell script
  • BLAST e-values

Lecture

Quiz 3

Introduction To Python

  • format of python script:

    #!/usr/bin/env  python
    
    print "Hello World"
    
  • comments, header:

    #!/usr/bin/env python
    
    """
    
       Author: James Vincent
       Date:   07-Feb-12
    
       This program prints Hello World.
    
    """
    
    # print today's message
    print "Hello World"
    
  • variable naming: myVar or my_var:

    #!/usr/bin/env python
    
    """
    
       Author: James Vincent
       Date:   07-Feb-12
    
       This program prints Hello World.
    
    """
    
    thisMessage = "Hello World"
    print thisMessage
    
    that_message = "Hello World"
    print that_message
    
  • indentation is meaningful:

    #!/usr/bin/env python
    
    """
    
       Author: James Vincent
       Date:   07-Feb-12
    
       This program prints Hello World.
    
    """
    
    thisMessage = "Hello World"
       print thisMessage  # this will fail
    
    that_message = "Hello World"
    print that_message

Homework review

  • identify unknown sequence - two hits at 100% ??

BLAST

  • meaning of e-value
  • what if we make up our own sequence?
  • how does changing e-value affect results?

Homework Directories

  • create homework, quiz, project directories in home directory

  • all homework goes in its own subdirectory of homework:

    homework/week4/Tues
    homework/week4/Thurs
    homework/week5/Tues
    homework/week5/Thurs
    

Homework

Reading

Exercises

  • complete Exercises 5-10 and 13 in the python reading above

Turn In

  • make sure you have a directory called homework in your home directory
  • make subdirectories under homework for each week and day
  • turn in completed exercises from the reading above
  • include a descriptive header (in comments ) to every python program you write
  • write a shell script to run BLAST on the sequence from Homework5 (Thursday, last week) against the 16SMicrobial database (just like the last homework)
  • read the BLAST help ( -h and –help) to find output format options
  • make the output of the BLAST job in hit table format
  • find the option for setting e-value
  • write a second shell script run BLAST again but with evalue set to 0.000001


Thursday

Note

Learning Goals:

Lecture

Quiz 4

BLASTN revisited

  • blast programs are in /mnt/blast/ncbi-blast-2.2.25+/bin on the AWS server

  • download 16S database: ftp.ncbi.nih.gov/blast/db/16SMicrobial.tar.gz:

    (create $HOME/blast/databases if you don't already have it )
    cd  ~/blast/datbases
    
    ftp ftp.ncbi.nih.gov
    cd blast/db
    get 16SMicrobial.tar.gz
  • uncompress database:

    tar -zxf 16SMicrobial.tar.gz
  • use blastdbcmd -db 16Smicrobial -entry all to get all fasta sequences:

    /mnt/blast/ncbi-blast-2.2.25+/bin/blastdbcmd -db 16SMicrobial -entry all  > 16SMicrobial.fa
  • collect three sequences from the 16SMicrobial.fa file:

    head -300 16SMicrobial.fa >  testThree.fa
    edit testThree.fa so it contains three complete sequences
  • execute blastn -h to get help, find outfmt option:

    /mnt/blast/ncbi-blast-2.2.25+/bin/blastn -help | less
    use /outfmt  within less to find word outfmt
  • execute blastn:

    /mnt/blast/ncbi-blast-2.2.25+/bin/blastn -db 16SMicrobial -query testThree.fa

BLAST ASN output format

  • execute blastn again but this time use BLAST archive ASN format -outfmt 11 and an output file name:

    /mnt/blast/ncbi-blast-2.2.25+/bin/blastn -db 16SMicrobial -query testThree.fa  -outfmt 11  -out testThree.fa.blast.asn

Reformat BLAST ASN output format

  • Use testThree.fa.blast.asn outpfile to generate a different output format:

    /mnt/blast/ncbi-blast-2.2.25+/bin/blast_formatter -archive testThree.fa.blast.asn -outfmt 7

Put commands in a shell script

  • Use a variable for blast programs:

    #!/bin/bash
    
    BLASTN=/mnt/blast/ncbi-blast-2.2.25+/bin/blastn
    BLASTFORMATTER=/mnt/blast/ncbi-blast-2.2.25+/bin/blast_formatter
    DB=$HOME/blast/databases/16SMicrobial
    QUERY=testThree.fa
    OUTFILE=$QUERY.blast.asn
    
    # /mnt/blast/ncbi-blast-2.2.25+/bin/blastn -db 16SMicrobial -query testThree.fa  -outfmt 11  -out testThree.fa.blast.asn
    
    echo "Running BLASTN"
    echo "query: $QUERY"
    echo "db: $DB"
    $BLASTN -db $DB -query $QUERY -outfmt 11 -out $OUTFILE
    echo "Finished BLASTN"

Parsing BLAST output with python

Homework

Reading

Exercises

  • complete Exercises 15,16,17 in the python reading above

Turn In

  • turn in python exercises 15,16,17

  • put them in the proper homework directory in your home on the AWS server (for week 4, Thursday)

  • write a shell script called week4_Thurs.sh:

    use varaiables to hold the name and full path of the blastn program, query file and database
    create a single query file containing the two sequences below
    run blastn on the query file
    use the 16SMicrobial database
    make the output  ASN format
    reformat the output using blast_formatter command to give hit table format

Query sequences:

>gi|313761029|gb|GU197655.1| Anabaena bergii CHAB1385 16S ribosomal RNA gene, partial sequence
GGGTGAGTAACGCGTAAGAATCTACCTTCAGGTTGGGGACAACCACTGGAAACGGTGGCTAATACCGAAT
GTGCCGAGAGGTGAAAGGCTTGCTGCCTGAAGAAGAGCTTGCGTCTGATTAGCTAGTTGGTGGGGTAAGA
GCCTACCAAGGCGACGATCAGTAGCTGGTCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCC
AGACTCCTACGGGAGGCAGCAGTGGGGAATTTTCCGCAATGGGCGAAAGCCTGACGGAGCAATACCGCGT
GAGGGAGGAAGGCTCTTGGGTTGTAAACCTCTTTTCTCAGGGAAGAAGACAATGACGGTACCTGAGGAAT
AAGCATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGG
GCGTAAAGGGTCCGCAGGTGGTAGTGTAAGTCTGCTGTTAAAGAGTCACGCTCAACGTGATCAAAGCAGT
GGAAACTACACAACTAGAGTACGGTAGGGGCAGAAGGAATTCCTGGTGTAGCGGTGAAATGCGTAGATAT
CAGGAAGAACACCGGTGGCGAAAGCGTTCTGCTAGACCTGTACTGACACTGAGGGACGAAAGCTAGGGGA
GCGAATGGGATTAGATACCCCAGTAGTCCTAGCCGTAAACGATGGATACTAGGTGTGGCTTGTATCGACC
CGAGCCGTACCGTAGCTAACGCGTTAAGTATCCCGCCTGGGGAGTACGCACGCAAGTGTGAAACTCAAAG
GAATTGACGGGGGCCCGCACAAGCGGTGGAGTATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCA
AGGCTTGACATGTCGCGAATCTCGATGAAAGTTGAGAGTGCCTTCGGGAACGCGAACACAGGTGGTGCAT
GGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGTTTTTAGTT
GCCAGCATTAAGTTGGGCACTCTAGAGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAA
GTCAGCATGCCCCTTACGCCTTGGGCTACACACGTACTACAATGCTCCGGACAAAGGGCAGCTACACAGC
GATGTGATGCAAATCTCATAAACCGGAGCTCAGTTCAGATCGAAGGCTGCAACTCGCCTTCGTGAAGGAG
GAATCGCTAGTAATTGCAGGTCAGCATACTGCAGTGAATTCGTTCCCGGGCCTTGTACACACCGCCCGTC
ACACCATGGAAGTTGGTCACGCCCGAAGTCA

>gi|374092814|gb|JQ237773.1| Anabaena tenericaulis 08-10 16S ribosomal RNA gene, partial sequence
GACGGGTGAGTAACGCGTAAGAATCTACCTTCAGGTTGGGGACAACCACTGGAAACGGTGGCTAATACCC
AATGTGCCGAGAGGTGAAAGGCTTGCTGCCTGAAGAAGAGCTTGCGTCTGATTAGCTAGTTGGTGGGGTA
AGAGCCTACCAAGGCGACGATCAGTAGCTGGTCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGG
CCCAGACTCCTACGGGAGGCAGCAGTGGGGAATTTTCCGCAATGGGCGAAAGCCTGACGGAGCAATACCG
CGTGAGGGAGGAAGGCTCTTGGGTTGTAAACCTCTTTTCTCAGGGAAGAACAAAATGACGGTACCTGAGG
AATAAGCATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGAT
TGGGCGTAAAGGGTCCGCAGGTGGCATTGTAAGTCTGCTGTTAAAGAGTTTGGCTCAACCAAATAAAAGC
AGTGGAAACTACAAAGCTAGAGTGTGGTCGGGGCAGAGGGAATTCCTGGTGTAGCGGTGAAATGCGTAGA
TATCAGGAAGAACACCGGTGGCGAAGGCGCTCTGCTAGGCCAAGACTGACACTGAGGGACGAAAGCTAGG
GGAGCGAATGGGATTAGATACCCCAGTAGTCCTAGCCGTAAACGATGGATACTAGGCGTAGCTCGTATCG
ACCCGAGCTGTGCCGTAGCTAACGCGTTAAGTATCCCGCCTGGGGAGTACGCAGGCAACTGTGAAACTCA
AAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGTATGTGGTTTAATTCGATGCAACGCGAAGAACCTTA
CCAAGGCTTGACATGTCACGAATTCCGTTGAAAGATGGAAGTGCCTTCGGGAGCGTGAACACAGGTGGTG
CATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGTTTTTA
GTTGCCAGCATTAAGTTGGGCACTCTAGAGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGT
CAAGTCAGCATGCCCCTTACGTCTTGGGCTACACACGTACTACAATGCTACGGACAAAGGGCAGCTACAC
AGCGATGTGATGCGAATCTCATAAACCGTAGCTCAGTTCAGATCGAAGGCTGCAACTCGCCTTCGTGAAG
GAGGAATCGCTAGTAATTGCAGGTCAGCATACTGCAGTGAATTCGTTCCCGGGCCTTGTACACACCGCCC
GTCACACCATGGAAGTTGGTCACGCCCGAAGTCGTTACCCCAACCGCAAGGAGGGGGATGCCTAAGGTAG
GACTGATGACTGGGGTGAAGTCGTAACAAGGTAGCCGTACCGGAAGGTGTGGCTGGATCACCTCCTTTT