METHODS.docx - 冰豆网

资源描述

METHODS.docx

《METHODS.docx》由会员分享，可在线阅读，更多相关《METHODS.docx（56页珍藏版）》请在冰豆网上搜索。

METHODS.docx

METHODS

GATK-Methods

----DNAseq

DNAseqOVERVIEW1

Pre-processing2

LocalRealignmentaroundIndels2

BaseQualityScoreRecalibration3

BaseQualityScoreRecalibration（BQSR）3

VariantDiscoveryOverview13

CallingvariantsoncohortsofsamplesusingtheHaplotypeCallerinGVCFmode13

HowtheHaplotypeCaller'sreferenceconfidencemodelworks15

SuggestedPreliminaryAnalyses16

VariantFilteringwithVQSR16

VariantQualityScoreRecalibration（VQSR）16

GenotypeRefinement21

Whichtoolsusepedigreeinformation?

PurposeandoperationofRead-backedPhasing21

FunctionalAnnotation24

AddingGenomicAnnotationsUsingSnpEffandVariantAnnotator24

WherecanIgetagenelistinRefSeqformat?

VariantEvaluation31

Selectingvariantsofinterestfromacallset31

Combiningvariantsfromdifferentfilesintoone32

UsingVariantEval36

DNAseqOVERVIEW

Pre-processing

LocalRealignmentaroundIndels

RealignerTargetCreator/IndelRealigner

Foracomplete,detailedargumentreference,refertotheGATKdocumentpage here（RealignerTargetCreator）/here（IndelRealigner）

RunningtheIndelRealigneronlyatknownsites

WhileweadvocateforusingtheIndelRealigneroveranaggregatedbamusingthefullSmith-Watermanalignmentalgorithm,itwillworkforjustasinglelaneofsequencingdatawhenrunin-knownsOnlymode.Novelsitesobviouslywon'tbecleanedup,butthemajorityofasingleindividual'sshortindelswillalreadyhavebeenseenindbSNPand/or1000Genomes.Onewouldemploytheknown-only/lane-levelrealignmentstrategyinalarge-scaleproject（e.g.1000Genomes）wherecomputationtimeisseverelyconstrainedandlimited.Wemodifytheexampleargumentsfromabovetoreflectthecommand-linesnecessaryforknown-only/lane-levelcleaning.

TheRealignerTargetCreatorstepwouldneedtobedonejustonceforasinglesetofindels;soaslongasthesetofknownindelsdoesn'tchange,theoutput.intervalsfilefrombelowwouldneverneedtoberecalculated.

java-Xmx1g-jar/path/to/GenomeAnalysisTK.jar\

-TRealignerTargetCreator\

-R/path/to/reference.fasta\

-o/path/to/output.intervals\

-known/path/to/indel_calls.vcf

TheIndelRealignerstepneedstoberunoneverybamfile.

java-Xmx4g-Djava.io.tmpdir=/path/to/tmpdir\

-jar/path/to/GenomeAnalysisTK.jar\

-I\

-R\

-TIndelRealigner\

-targetIntervals\

-o\

-known/path/to/indel_calls.vcf

--consensusDeterminationModelKNOWNS_ONLY\

-LOD0.4

BaseQualityScoreRecalibration

BaseQualityScoreRecalibration（BQSR）

DetailedinformationaboutcommandlineoptionsforBaseRecalibratorcanbefound here.

Introduction

Thetoolsinthispackagerecalibratebasequalityscoresofsequencing-by-synthesisreadsinanalignedBAMfile.Afterrecalibration,thequalityscoresintheQUALfieldineachreadintheoutputBAMaremoreaccurateinthatthereportedqualityscoreisclosertoitsactualprobabilityofmismatchingthereferencegenome.Moreover,therecalibrationtoolattemptstocorrectforvariationinqualitywithmachinecycleandsequencecontext,andbydoingsoprovidesnotonlymoreaccuratequalityscoresbutalsomorewidelydispersedones.ThesystemworksonBAMfilescomingfrommanysequencingplatforms:

Illumina,SOLiD,454,CompleteGenomics,PacificBiosciences,etc.

NewwiththereleaseofthefullversionofGATK2.0istheabilitytorecalibratenotonlythewell-knownbasequalityscoresbutalsobaseinsertionandbasedeletionqualityscores.Theseareper-basequantitieswhichestimatetheprobabilitythatthenextbaseinthereadwasmis-incorporatedormis-deleted（duetoslippage,forexample）.We'vefoundthatthesenewqualityscoresareveryvaluableinindelcallingalgorithms.InparticularthesenewprobabilitiesfitverynaturallyasthegappenaltiesinanHMM-basedindelcallingalgorithms.Wesuspecttherearemanyotherfantasticusesforthesedata.

Thisprocessisaccomplishedbyanalyzingthecovariationamongseveralfeaturesofabase.Forexample:

∙Reportedqualityscore

∙Thepositionwithintheread

∙Theprecedingandcurrentnucleotide（sequencingchemistryeffect）observedbythesequencingmachine

ThesecovariatesarethensubsequentlyappliedthroughapiecewisetabularcorrectiontorecalibratethequalityscoresofallreadsinaBAMfile.

Forexample,pre-calibrationafilecouldcontainonlyreportedQ25bases,whichseemsgood.However,itmaybethatthesebasesactuallymismatchthereferenceata1in100rate,soareactuallyQ20.Thesehigher-than-empiricalqualityscoresprovidefalseconfidenceinthebasecalls.Moreover,asiscommonwithsequencing-by-synthesismachine,basemismatcheswiththereferenceoccurattheendofthereadsmorefrequentlythanatthebeginning.Also,mismatchesarestronglyassociatedwithsequencingcontext,inthatthedinucleotideACisoftenmuchlowerqualitythanTG.TherecalibrationtoolwillnotonlycorrecttheaverageQinaccuracy（shiftingfromQ25toQ20）butidentifysubsetsofhigh-qualitybasesbyseparatingthelow-qualityendofreadbasesACbasesfromthehigh-qualityTGbasesatthestartoftheread.Seebelowforexamplesofpreandpostcorrectedvalues.

Thesystemwasdesignedforuserstobeabletoeasilyaddnewcovariatestothecalculations.ForuserswishingtoaddtheirowncovariatesimplylookatQualityScoreCovariate.javaforanideaofhowtoimplementtherequiredinterface.EachcovariateisaJavaclasswhichimplementstheorg.broadinstitute.sting.gatk.walkers.recalibration.Covariateinterface.Specifically,theclassneedstohaveagetValuemethoddefinedwhichlooksatthereadandassociatedsequencecontextandpullsoutthedesiredinformationsuchasmachinecycle.

Runningthetools

BaseRecalibrator

DetailedinformationaboutcommandlineoptionsforBaseRecalibratorcanbefound here.

ThisGATKprocessingstepwalksoverallofthereadsin my_reads.bam andtabulatesdataaboutthefollowingfeaturesofthebases:

∙readgroupthereadbelongsto

∙assignedqualityscore

∙machinecycleproducingthisbase

∙currentbase+previousbase（dinucleotide）

Foreachbin,wecountthenumberofbaseswithinthebinandhowoftensuchbasesmismatchthereferencebase,excludinglociknowntovaryinthepopulation,accordingtodbSNP.Afterrunningoverallreads,BaseRecalibratorproducesafilecalled my_reads.recal_data.grp,whichcontainsthedataneededtorecalibratereads.TheformatofthisGATKreportisdescribedbelow.

CreatingarecalibratedBAM

TocreatearecalibratedBAMyoucanuseGATK'sPrintReadswiththeengineon-the-flyrecalibrationcapability.Hereisatypicalcommandlinetodoso:

java-jarGenomeAnalysisTK.jar\

-TPrintReads\

-Rreference.fasta\

-Iinput.bam\

-BQSRrecalibration_report.grp\

-ooutput.bam

AftercomputingcovariatesintheinitialBAMFile,wethenwalkthroughtheBAMfileagainandrewritethequalityscores（intheQUALfield）usingthedatainthe recalibration_report.grp file,intoanewBAMfile.

Thisstepusestherecalibrationtabledatainrecalibration_report.grpproducedbyBaseRecalibrationtorecalibratethequalityscoresininput.bam,andwritingoutanewBAMfileoutput.bamwithrecalibratedQUALfieldvalues.

Effectivelythenewqualityscoreis:

∙thesumoftheglobaldifferencebetweenreportedqualityscoresandtheempiricalquality

∙plusthequalitybinspecificshift

∙plusthecyclexqualanddinucleotidexqualeffect

Followingrecalibration,thereadqualityscoresaremuchclosertotheirempiricalscoresthanbefore.Thismeanstheycanbeusedinastatisticallyrobustmannerfordownstreamprocessing,suchasSNPcalling.Inadditional,byaccountingforqualitychangesbycycleandsequencecontext,wecanidentifytrulyhighqualitybasesinthereads,oftenfindingasubsetofbasesthatareQ30evenwhennobaseswereoriginallylabeledassuch.

Miscellaneousinformation

∙Therecalibrationsystemisread-groupaware.Itseparatesthecovariatedatabyreadgroupintherecalibration_report.grpfile（using@RGtags）andPrintReadswillapplythisdataforeachreadgroupinthefile.WeroutinelyprocessBAMfileswithmultiplereadgroups.Pleasenotethatthememoryrequirementsscalelinearlywiththenumberofreadgroupsinthefile,sothatfileswithmanyreadgroupscouldrequireasignificantamountofRAMtostoreallofthecovariatedata.

∙Acriticaldeterminantofthequalityoftherecalibationisthenumberofobservedbasesandmismatchesineachbin.Thesystemwillnotworkwellonasmallnumberofalignedreads.Weusuallyexpectwellinexcessof100Mbasesfromanext-generationDNAsequencerperreadgroup.1Bbasesyieldssignificantlybetterresults.

∙Unlessyourdatabaseofvariationissopoorand/orvariationsocommoninyourorganismthatmostofyourmismatchesarerealsnps,youshouldalwaysperformrecalibrationonyourbamfile.Forhumans,withdbSNPandnow1000Genomesavailable,almostallofthemismatches-evenincancer-willbeerrors,andanaccurateerrormodel（essentialfordownstreamanalysis）canbeascertained.

∙Therecalibratorappliesa"yates"correctionforlowoccupancybins.RatherthaninferringthetrueQscorefrom#mismatches/#basesweactuallyinferitfrom（#mismatches+1）/（#bases+2）.Thisdealsverynicelywithoverfittingproblems,whichhasonlyaminorimpactondatasetswithbillionsofbasesbutiscriticaltoavoidoverconfidenceinrarebinsinsparsedata.

Examplepreandpostrecalibrationresults

∙RecalibrationofalanesequencedattheBroad

展开阅读全文