METHODS.docx
《METHODS.docx》由会员分享,可在线阅读,更多相关《METHODS.docx(56页珍藏版)》请在冰豆网上搜索。
![METHODS.docx](https://file1.bdocx.com/fileroot1/2023-2/11/ee990da2-b830-4802-9398-abe783edc92a/ee990da2-b830-4802-9398-abe783edc92a1.gif)
METHODS
GATK-Methods
----DNAseq
DNAseqOVERVIEW1
Pre-processing2
LocalRealignmentaroundIndels2
LocalRealignmentaroundIndels2
BaseQualityScoreRecalibration3
BaseQualityScoreRecalibration(BQSR)3
VariantDiscoveryOverview13
CallingvariantsoncohortsofsamplesusingtheHaplotypeCallerinGVCFmode13
HowtheHaplotypeCaller'sreferenceconfidencemodelworks15
SuggestedPreliminaryAnalyses16
VariantFilteringwithVQSR16
VariantQualityScoreRecalibration(VQSR)16
GenotypeRefinement21
Whichtoolsusepedigreeinformation?
21
PurposeandoperationofRead-backedPhasing21
FunctionalAnnotation24
AddingGenomicAnnotationsUsingSnpEffandVariantAnnotator24
WherecanIgetagenelistinRefSeqformat?
29
VariantEvaluation31
Selectingvariantsofinterestfromacallset31
Combiningvariantsfromdifferentfilesintoone32
UsingVariantEval36
DNAseqOVERVIEW
Pre-processing
LocalRealignmentaroundIndels
LocalRealignmentaroundIndels
RealignerTargetCreator/IndelRealigner
Foracomplete,detailedargumentreference,refertotheGATKdocumentpage here(RealignerTargetCreator)/here(IndelRealigner)
RunningtheIndelRealigneronlyatknownsites
WhileweadvocateforusingtheIndelRealigneroveranaggregatedbamusingthefullSmith-Watermanalignmentalgorithm,itwillworkforjustasinglelaneofsequencingdatawhenrunin-knownsOnlymode.Novelsitesobviouslywon'tbecleanedup,butthemajorityofasingleindividual'sshortindelswillalreadyhavebeenseenindbSNPand/or1000Genomes.Onewouldemploytheknown-only/lane-levelrealignmentstrategyinalarge-scaleproject(e.g.1000Genomes)wherecomputationtimeisseverelyconstrainedandlimited.Wemodifytheexampleargumentsfromabovetoreflectthecommand-linesnecessaryforknown-only/lane-levelcleaning.
TheRealignerTargetCreatorstepwouldneedtobedonejustonceforasinglesetofindels;soaslongasthesetofknownindelsdoesn'tchange,theoutput.intervalsfilefrombelowwouldneverneedtoberecalculated.
java-Xmx1g-jar/path/to/GenomeAnalysisTK.jar\
-TRealignerTargetCreator\
-R/path/to/reference.fasta\
-o/path/to/output.intervals\
-known/path/to/indel_calls.vcf
TheIndelRealignerstepneedstoberunoneverybamfile.
java-Xmx4g-Djava.io.tmpdir=/path/to/tmpdir\
-jar/path/to/GenomeAnalysisTK.jar\
-I\
-R\
-TIndelRealigner\
-targetIntervals\
-o\
-known/path/to/indel_calls.vcf
--consensusDeterminationModelKNOWNS_ONLY\
-LOD0.4
BaseQualityScoreRecalibration
BaseQualityScoreRecalibration(BQSR)
DetailedinformationaboutcommandlineoptionsforBaseRecalibratorcanbefound here.
Introduction
Thetoolsinthispackagerecalibratebasequalityscoresofsequencing-by-synthesisreadsinanalignedBAMfile.Afterrecalibration,thequalityscoresintheQUALfieldineachreadintheoutputBAMaremoreaccurateinthatthereportedqualityscoreisclosertoitsactualprobabilityofmismatchingthereferencegenome.Moreover,therecalibrationtoolattemptstocorrectforvariationinqualitywithmachinecycleandsequencecontext,andbydoingsoprovidesnotonlymoreaccuratequalityscoresbutalsomorewidelydispersedones.ThesystemworksonBAMfilescomingfrommanysequencingplatforms:
Illumina,SOLiD,454,CompleteGenomics,PacificBiosciences,etc.
NewwiththereleaseofthefullversionofGATK2.0istheabilitytorecalibratenotonlythewell-knownbasequalityscoresbutalsobaseinsertionandbasedeletionqualityscores.Theseareper-basequantitieswhichestimatetheprobabilitythatthenextbaseinthereadwasmis-incorporatedormis-deleted(duetoslippage,forexample).We'vefoundthatthesenewqualityscoresareveryvaluableinindelcallingalgorithms.InparticularthesenewprobabilitiesfitverynaturallyasthegappenaltiesinanHMM-basedindelcallingalgorithms.Wesuspecttherearemanyotherfantasticusesforthesedata.
Thisprocessisaccomplishedbyanalyzingthecovariationamongseveralfeaturesofabase.Forexample:
∙Reportedqualityscore
∙Thepositionwithintheread
∙Theprecedingandcurrentnucleotide(sequencingchemistryeffect)observedbythesequencingmachine
ThesecovariatesarethensubsequentlyappliedthroughapiecewisetabularcorrectiontorecalibratethequalityscoresofallreadsinaBAMfile.
Forexample,pre-calibrationafilecouldcontainonlyreportedQ25bases,whichseemsgood.However,itmaybethatthesebasesactuallymismatchthereferenceata1in100rate,soareactuallyQ20.Thesehigher-than-empiricalqualityscoresprovidefalseconfidenceinthebasecalls.Moreover,asiscommonwithsequencing-by-synthesismachine,basemismatcheswiththereferenceoccurattheendofthereadsmorefrequentlythanatthebeginning.Also,mismatchesarestronglyassociatedwithsequencingcontext,inthatthedinucleotideACisoftenmuchlowerqualitythanTG.TherecalibrationtoolwillnotonlycorrecttheaverageQinaccuracy(shiftingfromQ25toQ20)butidentifysubsetsofhigh-qualitybasesbyseparatingthelow-qualityendofreadbasesACbasesfromthehigh-qualityTGbasesatthestartoftheread.Seebelowforexamplesofpreandpostcorrectedvalues.
Thesystemwasdesignedforuserstobeabletoeasilyaddnewcovariatestothecalculations.ForuserswishingtoaddtheirowncovariatesimplylookatQualityScoreCovariate.javaforanideaofhowtoimplementtherequiredinterface.EachcovariateisaJavaclasswhichimplementstheorg.broadinstitute.sting.gatk.walkers.recalibration.Covariateinterface.Specifically,theclassneedstohaveagetValuemethoddefinedwhichlooksatthereadandassociatedsequencecontextandpullsoutthedesiredinformationsuchasmachinecycle.
Runningthetools
BaseRecalibrator
DetailedinformationaboutcommandlineoptionsforBaseRecalibratorcanbefound here.
ThisGATKprocessingstepwalksoverallofthereadsin my_reads.bam andtabulatesdataaboutthefollowingfeaturesofthebases:
∙readgroupthereadbelongsto
∙assignedqualityscore
∙machinecycleproducingthisbase
∙currentbase+previousbase(dinucleotide)
Foreachbin,wecountthenumberofbaseswithinthebinandhowoftensuchbasesmismatchthereferencebase,excludinglociknowntovaryinthepopulation,accordingtodbSNP.Afterrunningoverallreads,BaseRecalibratorproducesafilecalled my_reads.recal_data.grp,whichcontainsthedataneededtorecalibratereads.TheformatofthisGATKreportisdescribedbelow.
CreatingarecalibratedBAM
TocreatearecalibratedBAMyoucanuseGATK'sPrintReadswiththeengineon-the-flyrecalibrationcapability.Hereisatypicalcommandlinetodoso:
java-jarGenomeAnalysisTK.jar\
-TPrintReads\
-Rreference.fasta\
-Iinput.bam\
-BQSRrecalibration_report.grp\
-ooutput.bam
AftercomputingcovariatesintheinitialBAMFile,wethenwalkthroughtheBAMfileagainandrewritethequalityscores(intheQUALfield)usingthedatainthe recalibration_report.grp file,intoanewBAMfile.
Thisstepusestherecalibrationtabledatainrecalibration_report.grpproducedbyBaseRecalibrationtorecalibratethequalityscoresininput.bam,andwritingoutanewBAMfileoutput.bamwithrecalibratedQUALfieldvalues.
Effectivelythenewqualityscoreis:
∙thesumoftheglobaldifferencebetweenreportedqualityscoresandtheempiricalquality
∙plusthequalitybinspecificshift
∙plusthecyclexqualanddinucleotidexqualeffect
Followingrecalibration,thereadqualityscoresaremuchclosertotheirempiricalscoresthanbefore.Thismeanstheycanbeusedinastatisticallyrobustmannerfordownstreamprocessing,suchasSNPcalling.Inadditional,byaccountingforqualitychangesbycycleandsequencecontext,wecanidentifytrulyhighqualitybasesinthereads,oftenfindingasubsetofbasesthatareQ30evenwhennobaseswereoriginallylabeledassuch.
Miscellaneousinformation
∙Therecalibrationsystemisread-groupaware.Itseparatesthecovariatedatabyreadgroupintherecalibration_report.grpfile(using@RGtags)andPrintReadswillapplythisdataforeachreadgroupinthefile.WeroutinelyprocessBAMfileswithmultiplereadgroups.Pleasenotethatthememoryrequirementsscalelinearlywiththenumberofreadgroupsinthefile,sothatfileswithmanyreadgroupscouldrequireasignificantamountofRAMtostoreallofthecovariatedata.
∙Acriticaldeterminantofthequalityoftherecalibationisthenumberofobservedbasesandmismatchesineachbin.Thesystemwillnotworkwellonasmallnumberofalignedreads.Weusuallyexpectwellinexcessof100Mbasesfromanext-generationDNAsequencerperreadgroup.1Bbasesyieldssignificantlybetterresults.
∙Unlessyourdatabaseofvariationissopoorand/orvariationsocommoninyourorganismthatmostofyourmismatchesarerealsnps,youshouldalwaysperformrecalibrationonyourbamfile.Forhumans,withdbSNPandnow1000Genomesavailable,almostallofthemismatches-evenincancer-willbeerrors,andanaccurateerrormodel(essentialfordownstreamanalysis)canbeascertained.
∙Therecalibratorappliesa"yates"correctionforlowoccupancybins.RatherthaninferringthetrueQscorefrom#mismatches/#basesweactuallyinferitfrom(#mismatches+1)/(#bases+2).Thisdealsverynicelywithoverfittingproblems,whichhasonlyaminorimpactondatasetswithbillionsofbasesbutiscriticaltoavoidoverconfidenceinrarebinsinsparsedata.
Examplepreandpostrecalibrationresults
∙RecalibrationofalanesequencedattheBroad