blast教程.docx - 冰豆网

资源描述

blast教程.docx

《blast教程.docx》由会员分享，可在线阅读，更多相关《blast教程.docx（16页珍藏版）》请在冰豆网上搜索。

blast教程.docx

blast教程

生信文件格式，常用程序

Applications-HandsOn

BioinformaticsApplications Hands-on

Goal:

getfamiliarwithcommandlinebioinformaticstools,uselinuxshelltodrivesingleorpipedorscriptedcommands.Fourmini-modulesareincluded,addressingfoursetsofcommonbioinformaticstasks.

Dataanddirectorystructure

Makeacopyofallthedatafromcentralplacetoyourowndirectory:

$cd/export/

$cdyour-own-directory/

$cp-r/export/apps.

$cdapps/

$ls

data module1 module2 module3 module4

Wewilldiveintoeachdirectory,oneforeachmodule, intheorderbelow.

Module-1:

FASTA,BLASTandMSA

InthismodulewewanttopracticemanipulatingFASTAfileandindoingso,constructphylogenetictreeforacertaingenefamilyCesA.

$cdmodule1/

$ls

Kenttools isanexcellent setoftoolsusedbyUCSCgenomebrowserteam（http:

//hgdownload.cse.ucsc.edu/admin/jksrc.zip;butwe’veinstalledforyou）.Nodocumentations,buteasytouse.WewillpracticemanipulatingFASTAfile.

Gotomedicagowebsite（http:

//medicago.jcvi.org/cgi-bin/medicago/download.cgi）.Findalinkcalled“Mt3.5v5_GenesCDSSeq_20111014.fa”.RightclickandcopytheURL,backtoyourterminal,type“wget”,andthenCmd+V（Mac）orShift+Insert（Windows/Linux）topastethatlinkin!

$wget ftp:

//ftp.jcvi.org/pub/data/m_truncatula/Mt3.5/Annotation/Mt3.5v5/Mt3.5v...

ToseethegeneralstatisticsofaFASTAfile:

$faSizeMt3.5v5_GenesCDSSeq_20111014.fa

Thisfilecontains:

#sequences______,minsize______,maxsize_____

Toseethesizeofeachsequence:

$faSizeMt3.5v5_GenesCDSSeq_20111014.fa-detailed-tab

HowaboutfindingthelongestCDS?

Pipetheabovecommandtoasort（youwanttosortthesecondcolumn,numerically,andinreversesothelongestappearsfirst）

$faSizeMt3.5v5_GenesCDSSeq_20111014.fa-detailed-tab|sort-k2,2nr

Whichsequenceisthelongest?

（Hint:

you’llneedtoseethefirstfewlinesoftheoutputfromtheabovecommand）

LongestsequenceID:

________________,it’ssize:

____________bp

Now,youwanttoextractthelongestsequence.

$faOneRecordMt3.5v5_GenesCDSSeq_20111014.fa"IMGA|contig_49801_3.1"

Youneedtoquotetheidentifierbecausethereisapipe（|） inthename.

Savetheoutputtoafile（use> toredirectoutputtoanewfile）.RunfaSize-detailed onthenewfile,isthelengththesameasyoursawbefore?

Findgenefamilymembers

Iwanttostudyallthecellulosesynthase（CesA）genesinMedicago.HowshouldIdoit?

1.ExtractalistofCesAgenesfromarelatedspecies

2.QuerythislistagainsttheMedicagogenesettofishouttheMedicagoCesAgenes

3.Runmultiplesequencealignmentsandbuildaphylogenetictree

Manyofyouarealreadyfamiliarwiththisprocess,butlet’sdoitthroughthecommandline!

ExtractalistofCesAgenesfromA.thaliana

GotoTAIRwebsite（http:

//www.arabidopsis.org）.Inthesearchboxontopright,typein“CesA”.Onresultspage,findabutton“getallsequences”.Youwanttodownloadthemasproteinsequences.

Thereturnedpagehasallthesequences.Nowcopythemandthencreateafileontheserver,pastethecontentsin,andcallit“At-CesA.fasta”.Rememberyoucandoitthroughatexteditor:

gedit ornano.Note:

youdonotneedthetwocommentlinesinthebeginning.YouFASTAfileshouldstartwith‘>’.

Makesureyouhavealltheproteinsyouwantedtodownload:

$faSizeAt-CesA.fasta

HowmanysequencesinArabidopsis?

______

RunalocalBLAST

Inthiscontext,‘local’meansyouarerunningBLASTonyourownserver,notatNCBIoranyoneelse’sserver.Thisgivesyoutheflexibilityofcomparingyourqueryeitheragainstprecomputeddatabases（likeNR,Swissprot,trEMBL,etc.）oragainstacustomizeddatabase,containingaspecificsetofsequences.

QuerySequenceSet:

At-CesA.fasta

ReferenceDatabase:

Mt3.5v5_GenesCDSSeq_20111014.fa

Rememberthequeryisasetofprotein sequences,databaseisasetofnucleotide sequences.WhichBLASTprogramshouldyouusehere?

______________________

（Hint:

chooseamong:

BLASTP,BLASTN,BLASTX,TBLASTN,TBLASTX）

BLASTReference:

http:

//www.ncbi.nlm.nih.gov/BLAST/blast_program.shtml

Generateacustomizedsearchdatabase（MedicagoCDSs）

$makeblastdb-dbtypenucl-inMt3.5v5_GenesCDSSeq_20111014.fa

BLASTtheAtCesA’sagainstMedicagoCDSs

$tblastn-queryAt-CesA.fasta-dbMt3.5v5_GenesCDSSeq_20111014.fa-outfmt6-outAt-CesA.MtCDS.tblastn

Waituntilitfinishesthentakealookattheoutput:

$lessAt-CesA.MtCDS.tblastn

BLASTtabularformat（asspecifiedin-outfmt6）hasmultiplecolumns,andistheeasiestoutputformattoworkwith.

Whichcolumnhasthequery?

Whichcolumnhasthesubject（i.e.thesequencesinthedatabasethatmatchyourquery,whichyouwanttoextractoutlater）?

Thesubjectcolumnis______

Cutoutthecolumn（-f2）:

$cut-f2At-CesA.MtCDS.tblastn|sort-u>hits.ids

Howmanyuniquehitsdidweget?

Thisequalstothenumberoflinesinthefile:

$wc-lhits.ids

Numberofuniquehits:

_________

Itappearswearegettinglotsofhits.WemayneedtogobacktotheBLASTcommandtoaddanE-valuecutofftobemorestringent.HowtoaddE-valuecutoff?

Let’slookatthehelpfortblastn

$tblastn-help

NowrunBLASTwithmorestringentsettings（E-valuecutoff:

1e-20）:

$tblastn-queryAt-CesA.fasta-dbMt3.5v5_GenesCDSSeq_20111014.fa-outfmt6-outAt-CesA.MtCDS.tblastn-evalue1e-20

$cut-f2At-CesA.MtCDS.tblastn|sort-u>hits.ids

NumberofuniquehitswithE-value1e-20:

_________

Extractsequencesfrommedicago

ThecommandfaSomeRecords canbeusedtoextractmultiplesequences,nowlet’sgetalltheBLASThitstoretrieveagenefamily.

$faSomeRecordsMt3.5v5_GenesCDSSeq_20111014.fahits.idshits.fasta

Verifythatyouaregettingallthemedicagohitsthatyouwant:

-countthelinesinhits.ids（hint:

$wc-lhits.ids）_________

-countthenumberofrecordsinhits.fasta（hint:

$faSizehits.fasta）________

Thetwonumbersshouldagree.Whenworkinginbioinformatics,alwayscheck,check,andchecktomakesurenumbersaddup.

RunMUSCLEtogetamultiplesequencealignment

$muscle-maxiters1-inhits.fasta-clw-outhits.aln

Notewejustwantafastalignmenthere（-maxiter1）.Takealookattheoutputfilehits.aln.

HowtomakemyslowBLASTjobrunfaster?

Welcometotheworldofparallelcomputing!

TheBLASTjobiswhatwecall“embarrassinglyparallel”,meaningthatit’srelativelyeasytodo-byhavingeachCPUworkonapartoftheinputfile!

Todothis,let’ssplitthequeryFASTAfirstintofourfiles.LearnhowtoconstructthecommandbyreadingthefaSplit manual,simplytypefaSplit withnoarguments.

$faSplitsequenceAt-CesA.fasta4split

Wewillthenuseashelltoolcalled GNUParallel,thatallowsuserstoeasilylaunchmultiplejobsbyusingatemplatecontainingplaceholders,whichcaneasilybereplacedwithcontentfromaninputsource.

Firstlet’sidentifytheinputsource.Inourcase,itisgoingtobethe4FASTAfilesyoujustcreatedusingfaSplit

$lssplit*.fa*

Whatdoyousee?

Wewillnowtakethetblastn commandfromabove,insertsomeplaceholdersandpasstheinputsourcetoit（usingthepipetoredirecttheinput）.

$lssplit*.fa*|parallel'tblastn-query{}-dbMt3.5v5_GenesCDSSeq_20111014.fa-outfmt6-out{.}.MtCDS.tblastn-evalue1e-20'

Meaningoftheplaceholdersused:

{}:

Replacedbyeachlinefromtheinputsource

{.}:

Replacedbyeachlinefromtheinputsource,withtheextensionremoved

Withthismechanism,eachsplittedFASTAfilewillgenerateatblastncommand,makingatotalof4commandstorunsimultaneously,thereforegetjobsdonefaster.Don’tforgettoconcatenatetheoutputstogether.

$catsplit*.MtCDS.tblastn>all.MtCDS.tblastn

Refertothe manpage forGNUparallel forfurtherdetailedinformation,ifyouhavetime.

Module-2:

FASTQ,FASTQCandTRIMMOMATIC

$cd../module2/

$ls

PE-350.1.fastq PE-350.2.fastq adapters.fasta

UnderstandFASTQformat

AlsothefilePE-350.1.fastq andPE-350.2.fastq arepairedfiles,withrecordfromeachfilesequentiallymatcheachother,i.e.pairofreads,sothisiscalleda“paired-end”library.FASTQfilesaretextfiles:

$lessPE-350.1.fastq

Whatinformationisincluded?

Whatdoeseachlinemean?

FourconsecutivelinesdefineONEreadinthefastqfile

EachbasehasaqualityvaluecalledPHREDscore,typicallybetween0-40forIllumina.Phredscore:

10-10%error

20-1%error

30-0.1%error

40-0.01%error

Well,showingnumbersinfilesisnotreallycompact,soFASTQfileconvertsthenumericvaluetoacharacter.Therearealsotwosystemsofsuchconversion,PHRED+33andPHRED+64.PHRED+33isusedinallnewIlluminaprotocols.

CanyouguessifyourFASTQfileisinPHRED+33orPHRED+64encoding?

Quicktip:

Lookforstretchof#######orBBBBBBBthattrailsanyread,whichmeansit’sreallyBADquality.Ifyousee#######,thenthat’sPhred+33,andBBBBBBBBthat’sPhred+64（Why?

）

GobacktoourFASTQfiles, findoutanswerstothesequestions:

LibraryPE-350containsatotalof______reads.（hint:

count#oflineswith`wc-lPE-350.1.fastq`,thendivideitby4）.

Thebasequalityinmydataisencodedin________（Phred+33orPhred+64?

）.

Howgoodisyoursequencingrun?

UseFASTQC

WearetoolazytoinstallFASTQCforyou!

Seeifyoucangetittowork-GoogleFASTQCandfindthe“Download”link.Findthelinkthatsayssomethinglike“FastQC（Linuxzipfile）”.Nowdownloadit.

$wget http:

//www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.10.1.zip

$unzipfastqc_v0.10.1.zip

$cdFastQC

展开阅读全文