METHODS.docx

上传人:b****7 文档编号:10408188 上传时间:2023-02-11 格式:DOCX 页数:56 大小:1.57MB
下载 相关 举报
METHODS.docx_第1页
第1页 / 共56页
METHODS.docx_第2页
第2页 / 共56页
METHODS.docx_第3页
第3页 / 共56页
METHODS.docx_第4页
第4页 / 共56页
METHODS.docx_第5页
第5页 / 共56页
点击查看更多>>
下载资源
资源描述

METHODS.docx

《METHODS.docx》由会员分享,可在线阅读,更多相关《METHODS.docx(56页珍藏版)》请在冰豆网上搜索。

METHODS.docx

METHODS

 

GATK-Methods

 

----DNAseq

DNAseqOVERVIEW1

Pre-processing2

LocalRealignmentaroundIndels2

LocalRealignmentaroundIndels2

BaseQualityScoreRecalibration3

BaseQualityScoreRecalibration(BQSR)3

VariantDiscoveryOverview13

CallingvariantsoncohortsofsamplesusingtheHaplotypeCallerinGVCFmode13

HowtheHaplotypeCaller'sreferenceconfidencemodelworks15

SuggestedPreliminaryAnalyses16

VariantFilteringwithVQSR16

VariantQualityScoreRecalibration(VQSR)16

GenotypeRefinement21

Whichtoolsusepedigreeinformation?

21

PurposeandoperationofRead-backedPhasing21

FunctionalAnnotation24

AddingGenomicAnnotationsUsingSnpEffandVariantAnnotator24

WherecanIgetagenelistinRefSeqformat?

29

VariantEvaluation31

Selectingvariantsofinterestfromacallset31

Combiningvariantsfromdifferentfilesintoone32

UsingVariantEval36

DNAseqOVERVIEW

Pre-processing

LocalRealignmentaroundIndels

 LocalRealignmentaroundIndels

RealignerTargetCreator/IndelRealigner

Foracomplete,detailedargumentreference,refertotheGATKdocumentpage here(RealignerTargetCreator)/here(IndelRealigner)

RunningtheIndelRealigneronlyatknownsites

WhileweadvocateforusingtheIndelRealigneroveranaggregatedbamusingthefullSmith-Watermanalignmentalgorithm,itwillworkforjustasinglelaneofsequencingdatawhenrunin-knownsOnlymode.Novelsitesobviouslywon'tbecleanedup,butthemajorityofasingleindividual'sshortindelswillalreadyhavebeenseenindbSNPand/or1000Genomes.Onewouldemploytheknown-only/lane-levelrealignmentstrategyinalarge-scaleproject(e.g.1000Genomes)wherecomputationtimeisseverelyconstrainedandlimited.Wemodifytheexampleargumentsfromabovetoreflectthecommand-linesnecessaryforknown-only/lane-levelcleaning.

TheRealignerTargetCreatorstepwouldneedtobedonejustonceforasinglesetofindels;soaslongasthesetofknownindelsdoesn'tchange,theoutput.intervalsfilefrombelowwouldneverneedtoberecalculated.

java-Xmx1g-jar/path/to/GenomeAnalysisTK.jar\

-TRealignerTargetCreator\

-R/path/to/reference.fasta\

-o/path/to/output.intervals\

-known/path/to/indel_calls.vcf

TheIndelRealignerstepneedstoberunoneverybamfile.

java-Xmx4g-Djava.io.tmpdir=/path/to/tmpdir\

-jar/path/to/GenomeAnalysisTK.jar\

-I\

-R\

-TIndelRealigner\

-targetIntervals\

-o\

-known/path/to/indel_calls.vcf

--consensusDeterminationModelKNOWNS_ONLY\

-LOD0.4

BaseQualityScoreRecalibration

 BaseQualityScoreRecalibration(BQSR)

DetailedinformationaboutcommandlineoptionsforBaseRecalibratorcanbefound here.

Introduction

Thetoolsinthispackagerecalibratebasequalityscoresofsequencing-by-synthesisreadsinanalignedBAMfile.Afterrecalibration,thequalityscoresintheQUALfieldineachreadintheoutputBAMaremoreaccurateinthatthereportedqualityscoreisclosertoitsactualprobabilityofmismatchingthereferencegenome.Moreover,therecalibrationtoolattemptstocorrectforvariationinqualitywithmachinecycleandsequencecontext,andbydoingsoprovidesnotonlymoreaccuratequalityscoresbutalsomorewidelydispersedones.ThesystemworksonBAMfilescomingfrommanysequencingplatforms:

Illumina,SOLiD,454,CompleteGenomics,PacificBiosciences,etc.

NewwiththereleaseofthefullversionofGATK2.0istheabilitytorecalibratenotonlythewell-knownbasequalityscoresbutalsobaseinsertionandbasedeletionqualityscores.Theseareper-basequantitieswhichestimatetheprobabilitythatthenextbaseinthereadwasmis-incorporatedormis-deleted(duetoslippage,forexample).We'vefoundthatthesenewqualityscoresareveryvaluableinindelcallingalgorithms.InparticularthesenewprobabilitiesfitverynaturallyasthegappenaltiesinanHMM-basedindelcallingalgorithms.Wesuspecttherearemanyotherfantasticusesforthesedata.

Thisprocessisaccomplishedbyanalyzingthecovariationamongseveralfeaturesofabase.Forexample:

∙Reportedqualityscore

∙Thepositionwithintheread

∙Theprecedingandcurrentnucleotide(sequencingchemistryeffect)observedbythesequencingmachine

ThesecovariatesarethensubsequentlyappliedthroughapiecewisetabularcorrectiontorecalibratethequalityscoresofallreadsinaBAMfile.

Forexample,pre-calibrationafilecouldcontainonlyreportedQ25bases,whichseemsgood.However,itmaybethatthesebasesactuallymismatchthereferenceata1in100rate,soareactuallyQ20.Thesehigher-than-empiricalqualityscoresprovidefalseconfidenceinthebasecalls.Moreover,asiscommonwithsequencing-by-synthesismachine,basemismatcheswiththereferenceoccurattheendofthereadsmorefrequentlythanatthebeginning.Also,mismatchesarestronglyassociatedwithsequencingcontext,inthatthedinucleotideACisoftenmuchlowerqualitythanTG.TherecalibrationtoolwillnotonlycorrecttheaverageQinaccuracy(shiftingfromQ25toQ20)butidentifysubsetsofhigh-qualitybasesbyseparatingthelow-qualityendofreadbasesACbasesfromthehigh-qualityTGbasesatthestartoftheread.Seebelowforexamplesofpreandpostcorrectedvalues.

Thesystemwasdesignedforuserstobeabletoeasilyaddnewcovariatestothecalculations.ForuserswishingtoaddtheirowncovariatesimplylookatQualityScoreCovariate.javaforanideaofhowtoimplementtherequiredinterface.EachcovariateisaJavaclasswhichimplementstheorg.broadinstitute.sting.gatk.walkers.recalibration.Covariateinterface.Specifically,theclassneedstohaveagetValuemethoddefinedwhichlooksatthereadandassociatedsequencecontextandpullsoutthedesiredinformationsuchasmachinecycle.

Runningthetools

BaseRecalibrator

DetailedinformationaboutcommandlineoptionsforBaseRecalibratorcanbefound here.

ThisGATKprocessingstepwalksoverallofthereadsin my_reads.bam andtabulatesdataaboutthefollowingfeaturesofthebases:

∙readgroupthereadbelongsto

∙assignedqualityscore

∙machinecycleproducingthisbase

∙currentbase+previousbase(dinucleotide)

Foreachbin,wecountthenumberofbaseswithinthebinandhowoftensuchbasesmismatchthereferencebase,excludinglociknowntovaryinthepopulation,accordingtodbSNP.Afterrunningoverallreads,BaseRecalibratorproducesafilecalled my_reads.recal_data.grp,whichcontainsthedataneededtorecalibratereads.TheformatofthisGATKreportisdescribedbelow.

CreatingarecalibratedBAM

TocreatearecalibratedBAMyoucanuseGATK'sPrintReadswiththeengineon-the-flyrecalibrationcapability.Hereisatypicalcommandlinetodoso:

java-jarGenomeAnalysisTK.jar\

-TPrintReads\

-Rreference.fasta\

-Iinput.bam\

-BQSRrecalibration_report.grp\

-ooutput.bam

AftercomputingcovariatesintheinitialBAMFile,wethenwalkthroughtheBAMfileagainandrewritethequalityscores(intheQUALfield)usingthedatainthe recalibration_report.grp file,intoanewBAMfile.

Thisstepusestherecalibrationtabledatainrecalibration_report.grpproducedbyBaseRecalibrationtorecalibratethequalityscoresininput.bam,andwritingoutanewBAMfileoutput.bamwithrecalibratedQUALfieldvalues.

Effectivelythenewqualityscoreis:

∙thesumoftheglobaldifferencebetweenreportedqualityscoresandtheempiricalquality

∙plusthequalitybinspecificshift

∙plusthecyclexqualanddinucleotidexqualeffect

Followingrecalibration,thereadqualityscoresaremuchclosertotheirempiricalscoresthanbefore.Thismeanstheycanbeusedinastatisticallyrobustmannerfordownstreamprocessing,suchasSNPcalling.Inadditional,byaccountingforqualitychangesbycycleandsequencecontext,wecanidentifytrulyhighqualitybasesinthereads,oftenfindingasubsetofbasesthatareQ30evenwhennobaseswereoriginallylabeledassuch.

Miscellaneousinformation

∙Therecalibrationsystemisread-groupaware.Itseparatesthecovariatedatabyreadgroupintherecalibration_report.grpfile(using@RGtags)andPrintReadswillapplythisdataforeachreadgroupinthefile.WeroutinelyprocessBAMfileswithmultiplereadgroups.Pleasenotethatthememoryrequirementsscalelinearlywiththenumberofreadgroupsinthefile,sothatfileswithmanyreadgroupscouldrequireasignificantamountofRAMtostoreallofthecovariatedata.

∙Acriticaldeterminantofthequalityoftherecalibationisthenumberofobservedbasesandmismatchesineachbin.Thesystemwillnotworkwellonasmallnumberofalignedreads.Weusuallyexpectwellinexcessof100Mbasesfromanext-generationDNAsequencerperreadgroup.1Bbasesyieldssignificantlybetterresults.

∙Unlessyourdatabaseofvariationissopoorand/orvariationsocommoninyourorganismthatmostofyourmismatchesarerealsnps,youshouldalwaysperformrecalibrationonyourbamfile.Forhumans,withdbSNPandnow1000Genomesavailable,almostallofthemismatches-evenincancer-willbeerrors,andanaccurateerrormodel(essentialfordownstreamanalysis)canbeascertained.

∙Therecalibratorappliesa"yates"correctionforlowoccupancybins.RatherthaninferringthetrueQscorefrom#mismatches/#basesweactuallyinferitfrom(#mismatches+1)/(#bases+2).Thisdealsverynicelywithoverfittingproblems,whichhasonlyaminorimpactondatasetswithbillionsofbasesbutiscriticaltoavoidoverconfidenceinrarebinsinsparsedata.

Examplepreandpostrecalibrationresults

∙RecalibrationofalanesequencedattheBroad

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 高等教育 > 军事

copyright@ 2008-2022 冰豆网网站版权所有

经营许可证编号:鄂ICP备2022015515号-1