Note
Learning Goals:
Homework review
Quiz 2 Do-Over
Class Project
BLAST
redirection
sort
Reading
http://www.ncbi.nlm.nih.gov/books/NBK21097/#top
Read up to Appendix 1: FASTA header
Exercises
Complete the exercises in the UNIX tutorials (6,8) assigned above.Turn In
Create a file called homework-4a.txt within your homework directory on the AWS server. Follow these links and using BLAST from a web page determine what these sequences are:
- http://www.ncbi.nlm.nih.gov/nuccore/371943082?report=fasta
- http://www.ncbi.nlm.nih.gov/nuccore/372220095?report=fasta
- http://www.ncbi.nlm.nih.gov/nuccore/372199319?report=fasta
Write a very brief description (in the file homework-4a.txt) of what you did to identify the sequences given above. Do not copy and paste the output of BLAST.
Note
Learning Goals:
What will be covered in class:
redirection
- ls -l and redirect output to a file
- append date to this file
sort
- sort files in /dev with and without -r
blastn
download blast executable: ftp.ncbi.nih.gov//blast/executables/blast+/2.2.25/ncbi-blast-2.2.25+-ia32-linux.tar.gz
download 16S database: ftp.ncbi.nih.gov/blast/db/16SMicrobial.tar.gz
uncompress database, executables
use blastdbcmd -db 16Smicrobial -entry all to get all fasta sequences:
../ncbi-blast-2.2.25+/bin/blastdbcmd -db 16SMicrobial -entry all > 16SMicrobial.fahead -30 or so to get first fasta sequence to use as a test, redirect to file named 1.fa:
head -30 16SMicrobial.fa >1.fa edit 1.fa so it contains only one complete sequenceexecute blastn -h to get help
execute blastn:
../ncbi-blast-2.2.25+/bin/blastn -db 16SMicrobial -query 1.fagrep patterns
grep can search for patterns:
grep '^>' sequence.fa means find the pattern '^>' in the file sequence.fa The special character ^ means match only at the beginning of a line grep will look for the string > only at the start of a line. This is exactly what FASTA header lines start with.sort
Extract fasta definition lines from 16S database:
grep '^>' 16SMicrobial.fa > 16S_defLines.txtSort the definition lines:
sort 16S_defLines.txtSort the definition lines and pipe the output to less:
sort 16S_defLines.txt | lessSort in reverse order:
sort -r 16S_defLines.txt | lessSort by second field - the beginning of the description:
sort -k2 16S_defLines.txt | lessSort by second field in reverse:
sort -k2r 16S_defLines.txt | less
Reading
Exercises
- complete Exercises 1-4 in the python reading above
- use the AWS ec2 server to do this work
Turn In
- email to me the user ID you were assigned from XSEDE. For example, mine is: jvincent
- leave the python files ex1.py through ex4.py in your homework directory on the AWS server
- write a shell script to run blastn against the 16S microbial database for the uknown sequence below
- copy the sequence into a text file fist ( homework5.fa )
- in the script use redirect to send all blast output to text file
- use variables for the database name, query file and blast program
- use full paths in the variable names
Unknown sequence:
>Homework5
AGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAACACATGCAAGTCGAACGGAGACAATTGGTTCGCTGA
TTGTCTTAGTGGCGGACGGGTGAGTAACGCGTGAGCAATCTGCCCTTCGGAGGGGGACAACAGCTGGAAACGGCTGCTAA
TACCGCATAATGTATATTCAAGGCATCTTGGATATACCAAAGATTTATCGCCGAAGGATGAGCTCGCGTCTGATTAGCTA
GTTGGTGAGGTAAAGGCTCACCAAGGCTGCGATCAGTAGCCGGACTGAGAGGTTGAACGGCCACATTGGAACTGAGATAC
GGGCCAGACTCCTACGGGAGGGAGCAGTGGGGAATTTTGGNCAATGGGGGAAAGCCNTACCCAGCAACGCCGCGTGAAGG
AAGAAGGCCTTCGGGTTGTAAACTTCTTTGACCAGGGACGAAACAAATGACGGTACCTGGAAAACAAGCCACGGCTAACT
ACGTGCCAGCAGCCGCGGTATTACGTAGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGCGCGTAGGCGGGAGT
ACAAGTCAGATGTGAAATCTGGGGGCTTAACCCTCAAACTGCATTTGAAACTGTATTTCTTGAGTATCGGAGAGGCAGGC
GGAATTCCTAGTGTAGCGGTGAAATGCGTTGATATTAGGAGGAACACCAGTGGCGAAGGCGGCCTGCTGGACGACAACTG
ACTCTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCTGTAAACGATGAATACTAGGTG
TGGGGGGACTGACCCCCTCCGTGCCGGAGTTAACACAATAAGTATTCCACCTGGGGAGTACGNCCGCAAGGTTGAAACTC
AAAGGAATTGACGGGGGCCCGCACAAGCAGTGGATTATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGGCTT
GACATCGTACTAACGAAGCAGAGATGCATTAGGTGCCCTTCCGGGGAAAGTATAGACAGGTGGTGCATGGTTGTCGTCAG
CTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATTGTNATTTGCTACNCGAGANCACTCTAGCG
AGGCTGCCGATGACAAACCGGAGGAAGGTGGGGACGACGTCAAATCATCATGCCCCTTATGTCCTGGGCTACACACGTAA
TACAATGTCTCTCACAGAGGGAAGCAAGACCGCGAGGTGGAGCAAATCCCTAAAATGCGTCTCAGTTCAGATTGCAGGCT
GCAACTCGCCTGCATGAAGTCGGAATTGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTAC
ACACCGCCCGTCACACCATGAGAGCCGGGAACACCCGAAGTCCGTAGTCTAACCGCAAGGGGGACGCGGCCGAAGGTGGG
TTTGGTAATTGGGGTGAAGTCGTAACAAGGTAGCCGTATCGGAAGGTGCGGCTGGATCACCTCCTT