blast教程.docx
《blast教程.docx》由会员分享,可在线阅读,更多相关《blast教程.docx(16页珍藏版)》请在冰豆网上搜索。
blast教程
生信文件格式,常用程序
Applications-HandsOn
BioinformaticsApplications Hands-on
Goal:
getfamiliarwithcommandlinebioinformaticstools,uselinuxshelltodrivesingleorpipedorscriptedcommands.Fourmini-modulesareincluded,addressingfoursetsofcommonbioinformaticstasks.
Dataanddirectorystructure
Makeacopyofallthedatafromcentralplacetoyourowndirectory:
$cd/export/
$cdyour-own-directory/
$cp-r/export/apps.
$cdapps/
$ls
data module1 module2 module3 module4
Wewilldiveintoeachdirectory,oneforeachmodule, intheorderbelow.
Module-1:
FASTA,BLASTandMSA
InthismodulewewanttopracticemanipulatingFASTAfileandindoingso,constructphylogenetictreeforacertaingenefamilyCesA.
$cdmodule1/
$ls
Kenttools isanexcellent setoftoolsusedbyUCSCgenomebrowserteam(http:
//hgdownload.cse.ucsc.edu/admin/jksrc.zip;butwe’veinstalledforyou).Nodocumentations,buteasytouse.WewillpracticemanipulatingFASTAfile.
Gotomedicagowebsite(http:
//medicago.jcvi.org/cgi-bin/medicago/download.cgi).Findalinkcalled“Mt3.5v5_GenesCDSSeq_20111014.fa”.RightclickandcopytheURL,backtoyourterminal,type“wget”,andthenCmd+V(Mac)orShift+Insert(Windows/Linux)topastethatlinkin!
$wget ftp:
//ftp.jcvi.org/pub/data/m_truncatula/Mt3.5/Annotation/Mt3.5v5/Mt3.5v...
ToseethegeneralstatisticsofaFASTAfile:
$faSizeMt3.5v5_GenesCDSSeq_20111014.fa
Thisfilecontains:
#sequences______,minsize______,maxsize_____
Toseethesizeofeachsequence:
$faSizeMt3.5v5_GenesCDSSeq_20111014.fa-detailed-tab
HowaboutfindingthelongestCDS?
Pipetheabovecommandtoasort(youwanttosortthesecondcolumn,numerically,andinreversesothelongestappearsfirst)
$faSizeMt3.5v5_GenesCDSSeq_20111014.fa-detailed-tab|sort-k2,2nr
Whichsequenceisthelongest?
(Hint:
you’llneedtoseethefirstfewlinesoftheoutputfromtheabovecommand)
LongestsequenceID:
________________,it’ssize:
____________bp
Now,youwanttoextractthelongestsequence.
$faOneRecordMt3.5v5_GenesCDSSeq_20111014.fa"IMGA|contig_49801_3.1"
Youneedtoquotetheidentifierbecausethereisapipe(|) inthename.
Savetheoutputtoafile(use> toredirectoutputtoanewfile).RunfaSize-detailed onthenewfile,isthelengththesameasyoursawbefore?
Findgenefamilymembers
Iwanttostudyallthecellulosesynthase(CesA)genesinMedicago.HowshouldIdoit?
1.ExtractalistofCesAgenesfromarelatedspecies
2.QuerythislistagainsttheMedicagogenesettofishouttheMedicagoCesAgenes
3.Runmultiplesequencealignmentsandbuildaphylogenetictree
Manyofyouarealreadyfamiliarwiththisprocess,butlet’sdoitthroughthecommandline!
ExtractalistofCesAgenesfromA.thaliana
GotoTAIRwebsite(http:
//www.arabidopsis.org).Inthesearchboxontopright,typein“CesA”.Onresultspage,findabutton“getallsequences”.Youwanttodownloadthemasproteinsequences.
Thereturnedpagehasallthesequences.Nowcopythemandthencreateafileontheserver,pastethecontentsin,andcallit“At-CesA.fasta”.Rememberyoucandoitthroughatexteditor:
gedit ornano.Note:
youdonotneedthetwocommentlinesinthebeginning.YouFASTAfileshouldstartwith‘>’.
Makesureyouhavealltheproteinsyouwantedtodownload:
$faSizeAt-CesA.fasta
HowmanysequencesinArabidopsis?
______
RunalocalBLAST
Inthiscontext,‘local’meansyouarerunningBLASTonyourownserver,notatNCBIoranyoneelse’sserver.Thisgivesyoutheflexibilityofcomparingyourqueryeitheragainstprecomputeddatabases(likeNR,Swissprot,trEMBL,etc.)oragainstacustomizeddatabase,containingaspecificsetofsequences.
QuerySequenceSet:
At-CesA.fasta
ReferenceDatabase:
Mt3.5v5_GenesCDSSeq_20111014.fa
Rememberthequeryisasetofprotein sequences,databaseisasetofnucleotide sequences.WhichBLASTprogramshouldyouusehere?
______________________
(Hint:
chooseamong:
BLASTP,BLASTN,BLASTX,TBLASTN,TBLASTX)
BLASTReference:
http:
//www.ncbi.nlm.nih.gov/BLAST/blast_program.shtml
Generateacustomizedsearchdatabase(MedicagoCDSs)
$makeblastdb-dbtypenucl-inMt3.5v5_GenesCDSSeq_20111014.fa
BLASTtheAtCesA’sagainstMedicagoCDSs
$tblastn-queryAt-CesA.fasta-dbMt3.5v5_GenesCDSSeq_20111014.fa-outfmt6-outAt-CesA.MtCDS.tblastn
Waituntilitfinishesthentakealookattheoutput:
$lessAt-CesA.MtCDS.tblastn
BLASTtabularformat(asspecifiedin-outfmt6)hasmultiplecolumns,andistheeasiestoutputformattoworkwith.
Whichcolumnhasthequery?
Whichcolumnhasthesubject(i.e.thesequencesinthedatabasethatmatchyourquery,whichyouwanttoextractoutlater)?
Thesubjectcolumnis______
Cutoutthecolumn(-f2):
$cut-f2At-CesA.MtCDS.tblastn|sort-u>hits.ids
Howmanyuniquehitsdidweget?
Thisequalstothenumberoflinesinthefile:
$wc-lhits.ids
Numberofuniquehits:
_________
Itappearswearegettinglotsofhits.WemayneedtogobacktotheBLASTcommandtoaddanE-valuecutofftobemorestringent.HowtoaddE-valuecutoff?
Let’slookatthehelpfortblastn
$tblastn-help
NowrunBLASTwithmorestringentsettings(E-valuecutoff:
1e-20):
$tblastn-queryAt-CesA.fasta-dbMt3.5v5_GenesCDSSeq_20111014.fa-outfmt6-outAt-CesA.MtCDS.tblastn-evalue1e-20
$cut-f2At-CesA.MtCDS.tblastn|sort-u>hits.ids
NumberofuniquehitswithE-value1e-20:
_________
Extractsequencesfrommedicago
ThecommandfaSomeRecords canbeusedtoextractmultiplesequences,nowlet’sgetalltheBLASThitstoretrieveagenefamily.
$faSomeRecordsMt3.5v5_GenesCDSSeq_20111014.fahits.idshits.fasta
Verifythatyouaregettingallthemedicagohitsthatyouwant:
-countthelinesinhits.ids(hint:
$wc-lhits.ids)_________
-countthenumberofrecordsinhits.fasta(hint:
$faSizehits.fasta)________
Thetwonumbersshouldagree.Whenworkinginbioinformatics,alwayscheck,check,andchecktomakesurenumbersaddup.
RunMUSCLEtogetamultiplesequencealignment
$muscle-maxiters1-inhits.fasta-clw-outhits.aln
Notewejustwantafastalignmenthere(-maxiter1).Takealookattheoutputfilehits.aln.
HowtomakemyslowBLASTjobrunfaster?
!
Welcometotheworldofparallelcomputing!
TheBLASTjobiswhatwecall“embarrassinglyparallel”,meaningthatit’srelativelyeasytodo-byhavingeachCPUworkonapartoftheinputfile!
Todothis,let’ssplitthequeryFASTAfirstintofourfiles.LearnhowtoconstructthecommandbyreadingthefaSplit manual,simplytypefaSplit withnoarguments.
$faSplitsequenceAt-CesA.fasta4split
Wewillthenuseashelltoolcalled GNUParallel,thatallowsuserstoeasilylaunchmultiplejobsbyusingatemplatecontainingplaceholders,whichcaneasilybereplacedwithcontentfromaninputsource.
Firstlet’sidentifytheinputsource.Inourcase,itisgoingtobethe4FASTAfilesyoujustcreatedusingfaSplit
$lssplit*.fa*
Whatdoyousee?
Wewillnowtakethetblastn commandfromabove,insertsomeplaceholdersandpasstheinputsourcetoit(usingthepipetoredirecttheinput).
$lssplit*.fa*|parallel'tblastn-query{}-dbMt3.5v5_GenesCDSSeq_20111014.fa-outfmt6-out{.}.MtCDS.tblastn-evalue1e-20'
Meaningoftheplaceholdersused:
{}:
Replacedbyeachlinefromtheinputsource
{.}:
Replacedbyeachlinefromtheinputsource,withtheextensionremoved
Withthismechanism,eachsplittedFASTAfilewillgenerateatblastncommand,makingatotalof4commandstorunsimultaneously,thereforegetjobsdonefaster.Don’tforgettoconcatenatetheoutputstogether.
$catsplit*.MtCDS.tblastn>all.MtCDS.tblastn
Refertothe manpage forGNUparallel forfurtherdetailedinformation,ifyouhavetime.
Module-2:
FASTQ,FASTQCandTRIMMOMATIC
$cd../module2/
$ls
PE-350.1.fastq PE-350.2.fastq adapters.fasta
UnderstandFASTQformat
AlsothefilePE-350.1.fastq andPE-350.2.fastq arepairedfiles,withrecordfromeachfilesequentiallymatcheachother,i.e.pairofreads,sothisiscalleda“paired-end”library.FASTQfilesaretextfiles:
$lessPE-350.1.fastq
Whatinformationisincluded?
Whatdoeseachlinemean?
FourconsecutivelinesdefineONEreadinthefastqfile
EachbasehasaqualityvaluecalledPHREDscore,typicallybetween0-40forIllumina.Phredscore:
10-10%error
20-1%error
30-0.1%error
40-0.01%error
Well,showingnumbersinfilesisnotreallycompact,soFASTQfileconvertsthenumericvaluetoacharacter.Therearealsotwosystemsofsuchconversion,PHRED+33andPHRED+64.PHRED+33isusedinallnewIlluminaprotocols.
CanyouguessifyourFASTQfileisinPHRED+33orPHRED+64encoding?
Quicktip:
Lookforstretchof#######orBBBBBBBthattrailsanyread,whichmeansit’sreallyBADquality.Ifyousee#######,thenthat’sPhred+33,andBBBBBBBBthat’sPhred+64(Why?
)
GobacktoourFASTQfiles, findoutanswerstothesequestions:
LibraryPE-350containsatotalof______reads.(hint:
count#oflineswith`wc-lPE-350.1.fastq`,thendivideitby4).
Thebasequalityinmydataisencodedin________(Phred+33orPhred+64?
).
Howgoodisyoursequencingrun?
UseFASTQC
WearetoolazytoinstallFASTQCforyou!
J
Seeifyoucangetittowork-GoogleFASTQCandfindthe“Download”link.Findthelinkthatsayssomethinglike“FastQC(Linuxzipfile)”.Nowdownloadit.
$wget http:
//www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.10.1.zip
$unzipfastqc_v0.10.1.zip
$cdFastQC
$