blast教程.docx

上传人:b****7 文档编号:9997209 上传时间:2023-02-07 格式:DOCX 页数:16 大小:21.10KB
下载 相关 举报
blast教程.docx_第1页
第1页 / 共16页
blast教程.docx_第2页
第2页 / 共16页
blast教程.docx_第3页
第3页 / 共16页
blast教程.docx_第4页
第4页 / 共16页
blast教程.docx_第5页
第5页 / 共16页
点击查看更多>>
下载资源
资源描述

blast教程.docx

《blast教程.docx》由会员分享,可在线阅读,更多相关《blast教程.docx(16页珍藏版)》请在冰豆网上搜索。

blast教程.docx

blast教程

生信文件格式,常用程序

Applications-HandsOn

BioinformaticsApplications Hands-on

Goal:

getfamiliarwithcommandlinebioinformaticstools,uselinuxshelltodrivesingleorpipedorscriptedcommands.Fourmini-modulesareincluded,addressingfoursetsofcommonbioinformaticstasks.

 

Dataanddirectorystructure

 

Makeacopyofallthedatafromcentralplacetoyourowndirectory:

$cd/export/

$cdyour-own-directory/

$cp-r/export/apps.

$cdapps/

$ls

data module1 module2 module3 module4 

 

Wewilldiveintoeachdirectory,oneforeachmodule, intheorderbelow.

 

Module-1:

FASTA,BLASTandMSA

InthismodulewewanttopracticemanipulatingFASTAfileandindoingso,constructphylogenetictreeforacertaingenefamilyCesA.

 

$cdmodule1/

$ls

 

Kenttools isanexcellent setoftoolsusedbyUCSCgenomebrowserteam(http:

//hgdownload.cse.ucsc.edu/admin/jksrc.zip;butwe’veinstalledforyou).Nodocumentations,buteasytouse.WewillpracticemanipulatingFASTAfile.

 

Gotomedicagowebsite(http:

//medicago.jcvi.org/cgi-bin/medicago/download.cgi).Findalinkcalled“Mt3.5v5_GenesCDSSeq_20111014.fa”.RightclickandcopytheURL,backtoyourterminal,type“wget”,andthenCmd+V(Mac)orShift+Insert(Windows/Linux)topastethatlinkin!

$wget ftp:

//ftp.jcvi.org/pub/data/m_truncatula/Mt3.5/Annotation/Mt3.5v5/Mt3.5v...

 

ToseethegeneralstatisticsofaFASTAfile:

$faSizeMt3.5v5_GenesCDSSeq_20111014.fa

Thisfilecontains:

#sequences______,minsize______,maxsize_____

Toseethesizeofeachsequence:

$faSizeMt3.5v5_GenesCDSSeq_20111014.fa-detailed-tab

 

HowaboutfindingthelongestCDS?

Pipetheabovecommandtoasort(youwanttosortthesecondcolumn,numerically,andinreversesothelongestappearsfirst)

$faSizeMt3.5v5_GenesCDSSeq_20111014.fa-detailed-tab|sort-k2,2nr

 

Whichsequenceisthelongest?

(Hint:

 you’llneedtoseethefirstfewlinesoftheoutputfromtheabovecommand)

LongestsequenceID:

________________,it’ssize:

____________bp

 

Now,youwanttoextractthelongestsequence.

$faOneRecordMt3.5v5_GenesCDSSeq_20111014.fa"IMGA|contig_49801_3.1"

 

Youneedtoquotetheidentifierbecausethereisapipe(|) inthename.

 

Savetheoutputtoafile(use> toredirectoutputtoanewfile).RunfaSize-detailed onthenewfile,isthelengththesameasyoursawbefore?

 

Findgenefamilymembers

Iwanttostudyallthecellulosesynthase(CesA)genesinMedicago.HowshouldIdoit?

1.ExtractalistofCesAgenesfromarelatedspecies

2.QuerythislistagainsttheMedicagogenesettofishouttheMedicagoCesAgenes

3.Runmultiplesequencealignmentsandbuildaphylogenetictree

 

Manyofyouarealreadyfamiliarwiththisprocess,butlet’sdoitthroughthecommandline!

 

ExtractalistofCesAgenesfromA.thaliana

GotoTAIRwebsite(http:

//www.arabidopsis.org).Inthesearchboxontopright,typein“CesA”.Onresultspage,findabutton“getallsequences”.Youwanttodownloadthemasproteinsequences.

 

Thereturnedpagehasallthesequences.Nowcopythemandthencreateafileontheserver,pastethecontentsin,andcallit“At-CesA.fasta”.Rememberyoucandoitthroughatexteditor:

gedit ornano.Note:

youdonotneedthetwocommentlinesinthebeginning.YouFASTAfileshouldstartwith‘>’.

 

Makesureyouhavealltheproteinsyouwantedtodownload:

$faSizeAt-CesA.fasta

 

HowmanysequencesinArabidopsis?

______

 

RunalocalBLAST

Inthiscontext,‘local’meansyouarerunningBLASTonyourownserver,notatNCBIoranyoneelse’sserver.Thisgivesyoutheflexibilityofcomparingyourqueryeitheragainstprecomputeddatabases(likeNR,Swissprot,trEMBL,etc.)oragainstacustomizeddatabase,containingaspecificsetofsequences.

 

QuerySequenceSet:

At-CesA.fasta

ReferenceDatabase:

Mt3.5v5_GenesCDSSeq_20111014.fa

 

Rememberthequeryisasetofprotein sequences,databaseisasetofnucleotide sequences.WhichBLASTprogramshouldyouusehere?

______________________

(Hint:

 chooseamong:

BLASTP,BLASTN,BLASTX,TBLASTN,TBLASTX)

BLASTReference:

 http:

//www.ncbi.nlm.nih.gov/BLAST/blast_program.shtml

 

Generateacustomizedsearchdatabase(MedicagoCDSs)

$makeblastdb-dbtypenucl-inMt3.5v5_GenesCDSSeq_20111014.fa

 

BLASTtheAtCesA’sagainstMedicagoCDSs

$tblastn-queryAt-CesA.fasta-dbMt3.5v5_GenesCDSSeq_20111014.fa-outfmt6-outAt-CesA.MtCDS.tblastn

 

Waituntilitfinishesthentakealookattheoutput:

$lessAt-CesA.MtCDS.tblastn

 

BLASTtabularformat(asspecifiedin-outfmt6)hasmultiplecolumns,andistheeasiestoutputformattoworkwith. 

 

Whichcolumnhasthequery?

Whichcolumnhasthesubject(i.e.thesequencesinthedatabasethatmatchyourquery,whichyouwanttoextractoutlater)?

Thesubjectcolumnis______

 

Cutoutthecolumn(-f2):

$cut-f2At-CesA.MtCDS.tblastn|sort-u>hits.ids

 

Howmanyuniquehitsdidweget?

Thisequalstothenumberoflinesinthefile:

$wc-lhits.ids

Numberofuniquehits:

_________

 

Itappearswearegettinglotsofhits.WemayneedtogobacktotheBLASTcommandtoaddanE-valuecutofftobemorestringent.HowtoaddE-valuecutoff?

Let’slookatthehelpfortblastn

$tblastn-help

 

NowrunBLASTwithmorestringentsettings(E-valuecutoff:

1e-20):

$tblastn-queryAt-CesA.fasta-dbMt3.5v5_GenesCDSSeq_20111014.fa-outfmt6-outAt-CesA.MtCDS.tblastn-evalue1e-20

$cut-f2At-CesA.MtCDS.tblastn|sort-u>hits.ids

NumberofuniquehitswithE-value1e-20:

_________

 

Extractsequencesfrommedicago

ThecommandfaSomeRecords canbeusedtoextractmultiplesequences,nowlet’sgetalltheBLASThitstoretrieveagenefamily.

$faSomeRecordsMt3.5v5_GenesCDSSeq_20111014.fahits.idshits.fasta

 

Verifythatyouaregettingallthemedicagohitsthatyouwant:

-countthelinesinhits.ids(hint:

 $wc-lhits.ids)_________

-countthenumberofrecordsinhits.fasta(hint:

 $faSizehits.fasta)________

 

Thetwonumbersshouldagree.Whenworkinginbioinformatics,alwayscheck,check,andchecktomakesurenumbersaddup.

 

RunMUSCLEtogetamultiplesequencealignment

$muscle-maxiters1-inhits.fasta-clw-outhits.aln

Notewejustwantafastalignmenthere(-maxiter1).Takealookattheoutputfilehits.aln.

 

HowtomakemyslowBLASTjobrunfaster?

!

Welcometotheworldofparallelcomputing!

TheBLASTjobiswhatwecall“embarrassinglyparallel”,meaningthatit’srelativelyeasytodo-byhavingeachCPUworkonapartoftheinputfile!

 

Todothis,let’ssplitthequeryFASTAfirstintofourfiles.LearnhowtoconstructthecommandbyreadingthefaSplit manual,simplytypefaSplit withnoarguments.

 

$faSplitsequenceAt-CesA.fasta4split

 

Wewillthenuseashelltoolcalled GNUParallel,thatallowsuserstoeasilylaunchmultiplejobsbyusingatemplatecontainingplaceholders,whichcaneasilybereplacedwithcontentfromaninputsource.

 

Firstlet’sidentifytheinputsource.Inourcase,itisgoingtobethe4FASTAfilesyoujustcreatedusingfaSplit

$lssplit*.fa*

Whatdoyousee?

 

Wewillnowtakethetblastn commandfromabove,insertsomeplaceholdersandpasstheinputsourcetoit(usingthepipetoredirecttheinput).

 

$lssplit*.fa*|parallel'tblastn-query{}-dbMt3.5v5_GenesCDSSeq_20111014.fa-outfmt6-out{.}.MtCDS.tblastn-evalue1e-20'

 

Meaningoftheplaceholdersused:

{}:

Replacedbyeachlinefromtheinputsource

{.}:

 Replacedbyeachlinefromtheinputsource,withtheextensionremoved

 

Withthismechanism,eachsplittedFASTAfilewillgenerateatblastncommand,makingatotalof4commandstorunsimultaneously,thereforegetjobsdonefaster.Don’tforgettoconcatenatetheoutputstogether.

 

$catsplit*.MtCDS.tblastn>all.MtCDS.tblastn

 

Refertothe manpage forGNUparallel forfurtherdetailedinformation,ifyouhavetime.

 

 

Module-2:

FASTQ,FASTQCandTRIMMOMATIC

$cd../module2/

$ls

PE-350.1.fastq PE-350.2.fastq adapters.fasta

 

UnderstandFASTQformat

AlsothefilePE-350.1.fastq andPE-350.2.fastq arepairedfiles,withrecordfromeachfilesequentiallymatcheachother,i.e.pairofreads,sothisiscalleda“paired-end”library.FASTQfilesaretextfiles:

$lessPE-350.1.fastq

 

Whatinformationisincluded?

Whatdoeseachlinemean?

FourconsecutivelinesdefineONEreadinthefastqfile

 

EachbasehasaqualityvaluecalledPHREDscore,typicallybetween0-40forIllumina.Phredscore:

        10-10%error

        20-1%error

        30-0.1%error

        40-0.01%error

 

Well,showingnumbersinfilesisnotreallycompact,soFASTQfileconvertsthenumericvaluetoacharacter.Therearealsotwosystemsofsuchconversion,PHRED+33andPHRED+64.PHRED+33isusedinallnewIlluminaprotocols.

CanyouguessifyourFASTQfileisinPHRED+33orPHRED+64encoding?

Quicktip:

 Lookforstretchof#######orBBBBBBBthattrailsanyread,whichmeansit’sreallyBADquality.Ifyousee#######,thenthat’sPhred+33,andBBBBBBBBthat’sPhred+64(Why?

 

GobacktoourFASTQfiles, findoutanswerstothesequestions:

LibraryPE-350containsatotalof______reads.(hint:

count#oflineswith`wc-lPE-350.1.fastq`,thendivideitby4).

Thebasequalityinmydataisencodedin________(Phred+33orPhred+64?

).

 

Howgoodisyoursequencingrun?

UseFASTQC

WearetoolazytoinstallFASTQCforyou!

J

Seeifyoucangetittowork-GoogleFASTQCandfindthe“Download”link.Findthelinkthatsayssomethinglike“FastQC(Linuxzipfile)”.Nowdownloadit.

 

$wget http:

//www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.10.1.zip

$unzipfastqc_v0.10.1.zip

$cdFastQC

$

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 幼儿教育 > 育儿知识

copyright@ 2008-2022 冰豆网网站版权所有

经营许可证编号:鄂ICP备2022015515号-1