JSC-BIO-2710

Data Intensive Computing for Applied Bioinformatics

Table Of Contents

Previous topic

Week Two 24-Jan-12

This Page

Week Three 31-Jan-12

Tuesday

Note

Learning Goals:

  • class project overview - metagenomics
  • redirect in the shell
  • sort command
  • introduction to blast

Lecture

Homework

Reading

Exercises

Complete the exercises in the UNIX tutorials (6,8) assigned above.

Turn In

Create a file called homework-4a.txt within your homework directory on the AWS server. Follow these links and using BLAST from a web page determine what these sequences are:

Write a very brief description (in the file homework-4a.txt) of what you did to identify the sequences given above. Do not copy and paste the output of BLAST.



Thursday

Note

Learning Goals:

Lecture

What will be covered in class:

redirection

  • ls -l and redirect output to a file
  • append date to this file

sort

  • sort files in /dev with and without -r

blastn

  • download blast executable: ftp.ncbi.nih.gov//blast/executables/blast+/2.2.25/ncbi-blast-2.2.25+-ia32-linux.tar.gz

  • download 16S database: ftp.ncbi.nih.gov/blast/db/16SMicrobial.tar.gz

  • uncompress database, executables

  • use blastdbcmd -db 16Smicrobial -entry all to get all fasta sequences:

    ../ncbi-blast-2.2.25+/bin/blastdbcmd -db 16SMicrobial -entry all  > 16SMicrobial.fa
  • head -30 or so to get first fasta sequence to use as a test, redirect to file named 1.fa:

    head -30 16SMicrobial.fa >1.fa
    edit 1.fa so it contains only one complete sequence
  • execute blastn -h to get help

  • execute blastn:

    ../ncbi-blast-2.2.25+/bin/blastn -db 16SMicrobial -query 1.fa

grep patterns

  • grep can search for patterns:

    grep '^>' sequence.fa
    means find the pattern '^>' in the file sequence.fa
    The special character ^ means match only at the beginning of a line
    grep will look for the string > only at the start of a line.
    This is exactly what FASTA header lines start with.

sort

Extract fasta definition lines from 16S database:

grep '^>' 16SMicrobial.fa > 16S_defLines.txt

Sort the definition lines:

sort 16S_defLines.txt

Sort the definition lines and pipe the output to less:

sort 16S_defLines.txt | less

Sort in reverse order:

sort -r 16S_defLines.txt | less

Sort by second field - the beginning of the description:

sort -k2 16S_defLines.txt | less

Sort by second field in reverse:

sort -k2r 16S_defLines.txt | less

Homework

Reading

Exercises

  • complete Exercises 1-4 in the python reading above
  • use the AWS ec2 server to do this work

Turn In

  • email to me the user ID you were assigned from XSEDE. For example, mine is: jvincent
  • leave the python files ex1.py through ex4.py in your homework directory on the AWS server
  • write a shell script to run blastn against the 16S microbial database for the uknown sequence below
  • copy the sequence into a text file fist ( homework5.fa )
  • in the script use redirect to send all blast output to text file
  • use variables for the database name, query file and blast program
  • use full paths in the variable names

Unknown sequence:

>Homework5
AGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAACACATGCAAGTCGAACGGAGACAATTGGTTCGCTGA
TTGTCTTAGTGGCGGACGGGTGAGTAACGCGTGAGCAATCTGCCCTTCGGAGGGGGACAACAGCTGGAAACGGCTGCTAA
TACCGCATAATGTATATTCAAGGCATCTTGGATATACCAAAGATTTATCGCCGAAGGATGAGCTCGCGTCTGATTAGCTA
GTTGGTGAGGTAAAGGCTCACCAAGGCTGCGATCAGTAGCCGGACTGAGAGGTTGAACGGCCACATTGGAACTGAGATAC
GGGCCAGACTCCTACGGGAGGGAGCAGTGGGGAATTTTGGNCAATGGGGGAAAGCCNTACCCAGCAACGCCGCGTGAAGG
AAGAAGGCCTTCGGGTTGTAAACTTCTTTGACCAGGGACGAAACAAATGACGGTACCTGGAAAACAAGCCACGGCTAACT
ACGTGCCAGCAGCCGCGGTATTACGTAGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGCGCGTAGGCGGGAGT
ACAAGTCAGATGTGAAATCTGGGGGCTTAACCCTCAAACTGCATTTGAAACTGTATTTCTTGAGTATCGGAGAGGCAGGC
GGAATTCCTAGTGTAGCGGTGAAATGCGTTGATATTAGGAGGAACACCAGTGGCGAAGGCGGCCTGCTGGACGACAACTG
ACTCTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCTGTAAACGATGAATACTAGGTG
TGGGGGGACTGACCCCCTCCGTGCCGGAGTTAACACAATAAGTATTCCACCTGGGGAGTACGNCCGCAAGGTTGAAACTC
AAAGGAATTGACGGGGGCCCGCACAAGCAGTGGATTATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGGCTT
GACATCGTACTAACGAAGCAGAGATGCATTAGGTGCCCTTCCGGGGAAAGTATAGACAGGTGGTGCATGGTTGTCGTCAG
CTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATTGTNATTTGCTACNCGAGANCACTCTAGCG
AGGCTGCCGATGACAAACCGGAGGAAGGTGGGGACGACGTCAAATCATCATGCCCCTTATGTCCTGGGCTACACACGTAA
TACAATGTCTCTCACAGAGGGAAGCAAGACCGCGAGGTGGAGCAAATCCCTAAAATGCGTCTCAGTTCAGATTGCAGGCT
GCAACTCGCCTGCATGAAGTCGGAATTGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTAC
ACACCGCCCGTCACACCATGAGAGCCGGGAACACCCGAAGTCCGTAGTCTAACCGCAAGGGGGACGCGGCCGAAGGTGGG
TTTGGTAATTGGGGTGAAGTCGTAACAAGGTAGCCGTATCGGAAGGTGCGGCTGGATCACCTCCTT