Weight Matrices for Sequence Similarity Scoring.docx

资源描述

Weight Matrices for Sequence Similarity Scoring.docx

《Weight Matrices for Sequence Similarity Scoring.docx》由会员分享，可在线阅读，更多相关《Weight Matrices for Sequence Similarity Scoring.docx（14页珍藏版）》请在冰豆网上搜索。

Weight Matrices for Sequence Similarity Scoring.docx

WeightMatricesforSequenceSimilarityScoring

Version2.0

May1996

DavidWheeler,Ph.D.

DepartmentofCellBiology,

BaylorCollegeofMedicine

Houston,Texas

E-mail:

wheeler@bcm.tmc.edu

TableofContents

1.Weightmatricesforsequencesimilarityscoring

2.Importanceofscoringmatrices

3.Examplesofmatrices

4.Logoddsmatrices

5.PAMmatrixconstruction

6.PropertiesofthePAMmatrix

7.AssumptionsinthePAMmodel

8.BLOSUM（BlocksSubstitutionMatrix）matrix

9.Practicalaspects

10.SelectinganoptimalPAMmatrix

11.Otherspecializedscoringmatrices

WeightMatricesforSequenceSimilarityScoring

Outline:

1.Objective:

Overviewofmethodsandtheoriesthatunderlietheconstructionofscoringmatrices.

2.Examplesofweightmatricesfornucleotideandaminoacidscoring.

3.Transitionprobabilitymatrix:

PAM

oConstruction

oProperties

oSourcesoferror

4.BLOSUMmatrix

oConstruction

oSourcesoferror

5.Practicalaspects

6.Otherrefinementstotransitionprobabilitymatrices.

Reading:

∙D.G.George,W.C.BarkerandL.T.Hunt.（1990）.MutationDataMatrixandItsUses.inMethodsinEnzymologyvol183;R.F.Doolittle,ed.pp.333-351.AcademicPress,Inc.NewYork.

∙M.O.Dayhoff（1978）AtlasofProteinSequenceandStructure（Natl.Biomed.Res.Found.,Washington）,Vol.5,Suppl.3,pp.345-352.

∙S.F.Altschul（1991）.Aminoacidsubstitutionmatricesfromaninformationtheoreticperspective.J.Mol.Biol.219555-565.

∙S.F.Altschul,M.S.Boguski,W.GishandJ.C.Wootton.（1994）.Issuesinsearchingmolecularsequencedatabases.NatureGenetics6:

119-129.

BacktoTableofContents.

Importanceofscoringmatrices

∙Scoringmatricesappearinallanalysisinvolvingsequencecomparison.

∙Thechoiceofmatrixcanstronglyinfluencetheoutcomeoftheanalysis.

∙Scoringmatricesimplicitlyrepresentaparticulartheoryofevolution.

∙Understandingtheoriesunderlyingagivenscoringmatrixcanaidinmakingproperchoice.

Similarityvs.Distance

1.Elementsofthematricesspecifytheweighttoassignagivencomparisonby:

othecostofreplacingoneresiduewithanother（distance）;or

oameasureofthesimilarityforthereplacement.

2.Distanceismorenaturallyusedforphylogenetictreereconstruction;similarityisusedfordatabasesearching.

3.Thelogicofthealgorithmdoesn'tchange:

maximizingasimilarityisfundamentallythesameasminimizingadistance.

4.Distanceandsimilaritymatricesareinter-convertiblebysomemathematicaltransformationappropriateforthegivenapplication.

BacktoTableofContents.

Examplesofmatrices

Aremarkonnotation

Whenweconsiderscoringmatrices,weencountertheconventionthatmatriceshavenumericindicescorrespondingtotherowsandcolumnsofthematrix.Thatis,

referstotheentryatthefirstrowandthefirstcolumn.Ingeneral,

referstotheentryattheithrowandthejthcolumn.Tousethisforsequencealignment,wesimplyassociateanumericvaluetoeachletterinthealphabetofthesequence.Forexample,ifthealphabetis

thenA=1,C=2,etc.Thus,onewouldfindthescoreforamatchbetweenAandCat

.Sinceweconsiderdifferentscoringmatricesinthissection,wedistinguishbetweenthembyusingdifferentlettersforthematrix,

referstotheReplacementmatrix,

tothelogoddsmatric,andsoon.

Nucleotidescoring

1.Identitymatrix（similarity）

2.ATCG

4.A1000

6.T0100

8.C0010

10.G0001

Forelementsinrowibycolumnj:

11.BLASTmatrix（similarity）

12.ATCG

13.

14.A5-4-4-4

15.

16.T-45-4-4

17.

18.C-4-45-4

19.

20.G-4-4-45

21.Transition/TransversionMatrix

22.ATCG

23.

24.A0551

25.

26.T5015

27.

28.C5105

29.

30.G1550

Nucleotidebasesfallintotwocategoriesdependingontheringstructureofthebase.Purines（AdenineandGuanine）aretworingbases,pyrimidines（CytosineandThymine）aresingleringbases.MutationsinDNAarechangesinwhichonebaseisreplacedbyanother.Amutationthatconservestheringnumberiscalledatransition（e.g.,A->GorC->T）amutationthatchangestheringnumberarecalledtransversions.（e.g.A->CorA->Tandsoon）.

Althoughtherearemorewaystocreateatransversion,thenumberoftransitionsobservedtooccurinnature（i.e.,whencomparingrelatedDNAsequences）ismuchgreater.Sincethelikelihoodoftransitionsisgreater,itissometimesdesireabletocreateaweightmatrixwhichtakesthispropensityintoaccountwhencomparingtwoDNAsequences.

UseofaTransition/TransversionMatrixreducesnoiseincomparisonsofdistantlyrelatedsequences.

Proteinscoring

1.Identitymatrix

2.GeneticCodeMatrix

oScorebasedonminimumnumberofbasechangesrequiredtoconvertoneaminoacidintoanother.

Distancematrix

3.Physical/chemicalcharacteristics

oAttempttoquantifysomephysicalorchemicalattributeoftheresiduesandarbitrarilyassignweightsbasedonsimilaritiesoftheresiduesinthischosenproperty.

Hydrophobicitymatrix

BacktoTableofContents.

Logoddsmatrices

Sisthelogoddsratiooftwoprobabilities:

theprobabilitythattworesidues,iandj,arealignedbyevolutionarydescentandtheprobabilitythattheyarealignedbychance.

∙

arethefrequenciesthatresidueiandjareobservedtoaligninsequencesknowntoberelated.Theyarederivedfroma"transitionprobabilitymatrix."

∙

and

arefrequenciesofoccurrenceofresidueiandjinthesetofsequences.

∙e.g.,PAM250,BLOSUM62etal.

PAMMatrix

Summaryofsteps:

Alignsequencesthatareatleast85%identical.

ominimizeambiguityinalignments.

ominimizethenumberofcoincidentmutations.

Reconstructphylogenetictreesandinferancestralsequences.71treescontaining1,572exchangeswereused.

Tallyreplacements"accepted"bynaturalselection,inallpair-wisecomparisons（each

isthenumberoftimesaminoacidjwasreplacedbyaminoacidiinallcomparisons）.

Computeaminoacidmutability,

i.e.,thepropensityofagivenaminoacid,j,tobereplaced.

Combinedatafrom3&4toproduceaMutationProbabilityMatrixforonePAMofevolutionarydistance,accordingtothefollowingformulae:

CalculateLogOddsMatrixforsimilarityscoring:

DivideeachelementoftheMutationDataMatrix,M,bythefrequencyofoccuranceofeachresidue:

RisaRelatednessOddsMatrix,

isthefrequencyofresiduei.

TheLogOddsMatrix,

iscalculatedfromtherelatednessoddsmatrix,

simplybytakingthelogofeach

DifferentproteinfamiliesmanifestdifferentPAMrates.

BacktoTableofContents.

PropertiesofMutationProbablitiyMatrix

1.Thesumof

foranycolumn,j,isone（trivial）.Notethattheprobabilitythatanaminoacidwillchangeisontheorderof1%foreachaminoacid.Theprobabilitythatitwillstaythesameisontheorderot99%foreachaminoacid.

2.TheMutationProbabilityMatrix,M1,definesaunitofevolutionarychange:

specifically,1PAM（AcceptedPointMutationper100residues）.

othematrixcanbeusedtosimulateevolutionbyusingarandomnumbergeneratortoselectfateofeachresidueinthesequenceaccordingtotheprobabilitygiveninthetable.

oexposinga100residueproteinsequenceofaveragecompositiontotheevolutionarychangerepresentedbyM1resultsinoneaminoacidchange,onaverage.

3.SuccessiveapplicationofM1onasequenceyields2,3,4...PAMsofevolutionarychange.

4.Thefollowingoperationsareequivalent:

osuccessiveapplicationofM1onasequence.

omatrixmultiplicationofM1byitself,M1*M1,followedbyoperationonasequence.

oscalingtheelementsofM1byaconstantofproportionality,

=1,2,3...accordingtotheformulaebelow,followedbyoperationonasequence:

theaboveequationenablesthedirectcalculationofamatrixforanydesiredPAMdistance.

5.Thematrixhascompositionalinformationinit,sinceitdependsontherelativefrequenciesofaminoacidsinthepoolofsequencesfromwhichthetalliesweredrawn.Intheextremes,thefollowingobtain:

oTheelementsofthe0PAMmatrixare1for

and0for

;

oThe

PAMmatrixelementsapproachestheasymptoticaminoacidcomposition.

BacktoTableofContents.

AssumptionsinPAMmodel:

1.replacementatanysitedependsonlyontheaminoacidatthatsiteandtheprobabilitygivenbythetable（Markovmodel）.

2.sequencesthatarebeingcomparedhaveaverageaminoacidcomposition.

SourcesoferrorinPAMmodel

1.Manysequencesdepartfromaveragecomposition.

2.Rarereplacementswereobservedtooinfrequentlytoresolverelativeprobabilitiesaccurately（for36pairsnoreplacementswereobserved!

）.

3.Errorsin1PAMaremagnifiedintheextrapolationto250PAM.

4.TheMarkovprocessisanimperfectrepresentationofevolution:

Distantlyrelatedsequencesusuallyhaveislands（blocks）ofconservedresidues.Thisimpliesthatreplacementisnotequallyprobableoverentiresequence.

BacktoTableofContents.

BLOSUM（BlocksSubstitutionMatrix）Matrix

StevenHenikoffandJorjaG.Henikoff（1992）.Aminoacidsubstitutionmatricesfromproteinblocks.Proc.Natl.Acad.Sci.89:

10915-10919.

1.StartingdataisconservedblocksfromBlocksdatabase.

oaligned,ungappedsequences

owidelyvaryingsimilarity,butmeasuresaretakentoavoidbiasingthesamplewithfrequentlyoccurringhighlyrelatedsequences.

2.Talliesofreplacementsaremadebystraightforwardtallyingofallpairsofalignedresidues,

oTheobser

展开阅读全文