Weight Matrices for Sequence Similarity Scoring.docx
《Weight Matrices for Sequence Similarity Scoring.docx》由会员分享,可在线阅读,更多相关《Weight Matrices for Sequence Similarity Scoring.docx(14页珍藏版)》请在冰豆网上搜索。
WeightMatricesforSequenceSimilarityScoring
WeightMatricesforSequenceSimilarityScoring
Version2.0
May1996
DavidWheeler,Ph.D.
DepartmentofCellBiology,
BaylorCollegeofMedicine
Houston,Texas
E-mail:
wheeler@bcm.tmc.edu
TableofContents
1.Weightmatricesforsequencesimilarityscoring
2.Importanceofscoringmatrices
3.Examplesofmatrices
4.Logoddsmatrices
5.PAMmatrixconstruction
6.PropertiesofthePAMmatrix
7.AssumptionsinthePAMmodel
8.BLOSUM(BlocksSubstitutionMatrix)matrix
9.Practicalaspects
10.SelectinganoptimalPAMmatrix
11.Otherspecializedscoringmatrices
WeightMatricesforSequenceSimilarityScoring
Outline:
1.Objective:
Overviewofmethodsandtheoriesthatunderlietheconstructionofscoringmatrices.
2.Examplesofweightmatricesfornucleotideandaminoacidscoring.
3.Transitionprobabilitymatrix:
PAM
oConstruction
oProperties
oSourcesoferror
4.BLOSUMmatrix
oConstruction
oSourcesoferror
5.Practicalaspects
6.Otherrefinementstotransitionprobabilitymatrices.
Reading:
∙D.G.George,W.C.BarkerandL.T.Hunt.(1990).MutationDataMatrixandItsUses.inMethodsinEnzymologyvol183;R.F.Doolittle,ed.pp.333-351.AcademicPress,Inc.NewYork.
∙M.O.Dayhoff(1978)AtlasofProteinSequenceandStructure(Natl.Biomed.Res.Found.,Washington),Vol.5,Suppl.3,pp.345-352.
∙S.F.Altschul(1991).Aminoacidsubstitutionmatricesfromaninformationtheoreticperspective.J.Mol.Biol.219555-565.
∙S.F.Altschul,M.S.Boguski,W.GishandJ.C.Wootton.(1994).Issuesinsearchingmolecularsequencedatabases.NatureGenetics6:
119-129.
BacktoTableofContents.
Importanceofscoringmatrices
∙Scoringmatricesappearinallanalysisinvolvingsequencecomparison.
∙Thechoiceofmatrixcanstronglyinfluencetheoutcomeoftheanalysis.
∙Scoringmatricesimplicitlyrepresentaparticulartheoryofevolution.
∙Understandingtheoriesunderlyingagivenscoringmatrixcanaidinmakingproperchoice.
Similarityvs.Distance
1.Elementsofthematricesspecifytheweighttoassignagivencomparisonby:
othecostofreplacingoneresiduewithanother(distance);or
oameasureofthesimilarityforthereplacement.
2.Distanceismorenaturallyusedforphylogenetictreereconstruction;similarityisusedfordatabasesearching.
3.Thelogicofthealgorithmdoesn'tchange:
maximizingasimilarityisfundamentallythesameasminimizingadistance.
4.Distanceandsimilaritymatricesareinter-convertiblebysomemathematicaltransformationappropriateforthegivenapplication.
BacktoTableofContents.
Examplesofmatrices
Aremarkonnotation
Whenweconsiderscoringmatrices,weencountertheconventionthatmatriceshavenumericindicescorrespondingtotherowsandcolumnsofthematrix.Thatis,
referstotheentryatthefirstrowandthefirstcolumn.Ingeneral,
referstotheentryattheithrowandthejthcolumn.Tousethisforsequencealignment,wesimplyassociateanumericvaluetoeachletterinthealphabetofthesequence.Forexample,ifthealphabetis
thenA=1,C=2,etc.Thus,onewouldfindthescoreforamatchbetweenAandCat
.Sinceweconsiderdifferentscoringmatricesinthissection,wedistinguishbetweenthembyusingdifferentlettersforthematrix,
referstotheReplacementmatrix,
tothelogoddsmatric,andsoon.
Nucleotidescoring
1.Identitymatrix(similarity)
2.ATCG
3.
4.A1000
5.
6.T0100
7.
8.C0010
9.
10.G0001
Forelementsinrowibycolumnj:
11.BLASTmatrix(similarity)
12.ATCG
13.
14.A5-4-4-4
15.
16.T-45-4-4
17.
18.C-4-45-4
19.
20.G-4-4-45
21.Transition/TransversionMatrix
22.ATCG
23.
24.A0551
25.
26.T5015
27.
28.C5105
29.
30.G1550
Nucleotidebasesfallintotwocategoriesdependingontheringstructureofthebase.Purines(AdenineandGuanine)aretworingbases,pyrimidines(CytosineandThymine)aresingleringbases.MutationsinDNAarechangesinwhichonebaseisreplacedbyanother.Amutationthatconservestheringnumberiscalledatransition(e.g.,A->GorC->T)amutationthatchangestheringnumberarecalledtransversions.(e.g.A->CorA->Tandsoon).
Althoughtherearemorewaystocreateatransversion,thenumberoftransitionsobservedtooccurinnature(i.e.,whencomparingrelatedDNAsequences)ismuchgreater.Sincethelikelihoodoftransitionsisgreater,itissometimesdesireabletocreateaweightmatrixwhichtakesthispropensityintoaccountwhencomparingtwoDNAsequences.
UseofaTransition/TransversionMatrixreducesnoiseincomparisonsofdistantlyrelatedsequences.
Proteinscoring
1.Identitymatrix
2.GeneticCodeMatrix
oScorebasedonminimumnumberofbasechangesrequiredtoconvertoneaminoacidintoanother.
o
Distancematrix
3.Physical/chemicalcharacteristics
oAttempttoquantifysomephysicalorchemicalattributeoftheresiduesandarbitrarilyassignweightsbasedonsimilaritiesoftheresiduesinthischosenproperty.
o
Hydrophobicitymatrix
BacktoTableofContents.
Logoddsmatrices
Sisthelogoddsratiooftwoprobabilities:
theprobabilitythattworesidues,iandj,arealignedbyevolutionarydescentandtheprobabilitythattheyarealignedbychance.
∙
arethefrequenciesthatresidueiandjareobservedtoaligninsequencesknowntoberelated.Theyarederivedfroma"transitionprobabilitymatrix."
∙
and
arefrequenciesofoccurrenceofresidueiandjinthesetofsequences.
∙e.g.,PAM250,BLOSUM62etal.
PAMMatrix
Summaryofsteps:
1.
Alignsequencesthatareatleast85%identical.
ominimizeambiguityinalignments.
ominimizethenumberofcoincidentmutations.
2.
Reconstructphylogenetictreesandinferancestralsequences.71treescontaining1,572exchangeswereused.
3.
Tallyreplacements"accepted"bynaturalselection,inallpair-wisecomparisons(each
isthenumberoftimesaminoacidjwasreplacedbyaminoacidiinallcomparisons).
4.
Computeaminoacidmutability,
i.e.,thepropensityofagivenaminoacid,j,tobereplaced.
5.
Combinedatafrom3&4toproduceaMutationProbabilityMatrixforonePAMofevolutionarydistance,accordingtothefollowingformulae:
6.
CalculateLogOddsMatrixforsimilarityscoring:
DivideeachelementoftheMutationDataMatrix,M,bythefrequencyofoccuranceofeachresidue:
RisaRelatednessOddsMatrix,
isthefrequencyofresiduei.
TheLogOddsMatrix,
iscalculatedfromtherelatednessoddsmatrix,
simplybytakingthelogofeach
.
7.
DifferentproteinfamiliesmanifestdifferentPAMrates.
BacktoTableofContents.
PropertiesofMutationProbablitiyMatrix
1.Thesumof
foranycolumn,j,isone(trivial).Notethattheprobabilitythatanaminoacidwillchangeisontheorderof1%foreachaminoacid.Theprobabilitythatitwillstaythesameisontheorderot99%foreachaminoacid.
2.TheMutationProbabilityMatrix,M1,definesaunitofevolutionarychange:
specifically,1PAM(AcceptedPointMutationper100residues).
othematrixcanbeusedtosimulateevolutionbyusingarandomnumbergeneratortoselectfateofeachresidueinthesequenceaccordingtotheprobabilitygiveninthetable.
oexposinga100residueproteinsequenceofaveragecompositiontotheevolutionarychangerepresentedbyM1resultsinoneaminoacidchange,onaverage.
3.SuccessiveapplicationofM1onasequenceyields2,3,4...PAMsofevolutionarychange.
4.Thefollowingoperationsareequivalent:
osuccessiveapplicationofM1onasequence.
omatrixmultiplicationofM1byitself,M1*M1,followedbyoperationonasequence.
oscalingtheelementsofM1byaconstantofproportionality,
=1,2,3...accordingtotheformulaebelow,followedbyoperationonasequence:
theaboveequationenablesthedirectcalculationofamatrixforanydesiredPAMdistance.
5.Thematrixhascompositionalinformationinit,sinceitdependsontherelativefrequenciesofaminoacidsinthepoolofsequencesfromwhichthetalliesweredrawn.Intheextremes,thefollowingobtain:
oTheelementsofthe0PAMmatrixare1for
and0for
;
oThe
PAMmatrixelementsapproachestheasymptoticaminoacidcomposition.
BacktoTableofContents.
AssumptionsinPAMmodel:
1.replacementatanysitedependsonlyontheaminoacidatthatsiteandtheprobabilitygivenbythetable(Markovmodel).
2.sequencesthatarebeingcomparedhaveaverageaminoacidcomposition.
SourcesoferrorinPAMmodel
1.Manysequencesdepartfromaveragecomposition.
2.Rarereplacementswereobservedtooinfrequentlytoresolverelativeprobabilitiesaccurately(for36pairsnoreplacementswereobserved!
).
3.Errorsin1PAMaremagnifiedintheextrapolationto250PAM.
4.TheMarkovprocessisanimperfectrepresentationofevolution:
Distantlyrelatedsequencesusuallyhaveislands(blocks)ofconservedresidues.Thisimpliesthatreplacementisnotequallyprobableoverentiresequence.
BacktoTableofContents.
BLOSUM(BlocksSubstitutionMatrix)Matrix
StevenHenikoffandJorjaG.Henikoff(1992).Aminoacidsubstitutionmatricesfromproteinblocks.Proc.Natl.Acad.Sci.89:
10915-10919.
1.StartingdataisconservedblocksfromBlocksdatabase.
oaligned,ungappedsequences
owidelyvaryingsimilarity,butmeasuresaretakentoavoidbiasingthesamplewithfrequentlyoccurringhighlyrelatedsequences.
2.Talliesofreplacementsaremadebystraightforwardtallyingofallpairsofalignedresidues,
oTheobser