Weight Matrices for Sequence Similarity Scoring.docx

上传人:b****8 文档编号:8887220 上传时间:2023-02-02 格式:DOCX 页数:14 大小:42.94KB
下载 相关 举报
Weight Matrices for Sequence Similarity Scoring.docx_第1页
第1页 / 共14页
Weight Matrices for Sequence Similarity Scoring.docx_第2页
第2页 / 共14页
Weight Matrices for Sequence Similarity Scoring.docx_第3页
第3页 / 共14页
Weight Matrices for Sequence Similarity Scoring.docx_第4页
第4页 / 共14页
Weight Matrices for Sequence Similarity Scoring.docx_第5页
第5页 / 共14页
点击查看更多>>
下载资源
资源描述

Weight Matrices for Sequence Similarity Scoring.docx

《Weight Matrices for Sequence Similarity Scoring.docx》由会员分享,可在线阅读,更多相关《Weight Matrices for Sequence Similarity Scoring.docx(14页珍藏版)》请在冰豆网上搜索。

Weight Matrices for Sequence Similarity Scoring.docx

WeightMatricesforSequenceSimilarityScoring

WeightMatricesforSequenceSimilarityScoring

Version2.0

May1996

DavidWheeler,Ph.D.

DepartmentofCellBiology,

BaylorCollegeofMedicine

Houston,Texas

E-mail:

wheeler@bcm.tmc.edu

TableofContents

1.Weightmatricesforsequencesimilarityscoring

2.Importanceofscoringmatrices

3.Examplesofmatrices

4.Logoddsmatrices

5.PAMmatrixconstruction

6.PropertiesofthePAMmatrix

7.AssumptionsinthePAMmodel

8.BLOSUM(BlocksSubstitutionMatrix)matrix

9.Practicalaspects

10.SelectinganoptimalPAMmatrix

11.Otherspecializedscoringmatrices

WeightMatricesforSequenceSimilarityScoring

Outline:

1.Objective:

Overviewofmethodsandtheoriesthatunderlietheconstructionofscoringmatrices.

2.Examplesofweightmatricesfornucleotideandaminoacidscoring.

3.Transitionprobabilitymatrix:

PAM

oConstruction

oProperties

oSourcesoferror

4.BLOSUMmatrix

oConstruction

oSourcesoferror

5.Practicalaspects

6.Otherrefinementstotransitionprobabilitymatrices.

Reading:

∙D.G.George,W.C.BarkerandL.T.Hunt.(1990).MutationDataMatrixandItsUses.inMethodsinEnzymologyvol183;R.F.Doolittle,ed.pp.333-351.AcademicPress,Inc.NewYork.

∙M.O.Dayhoff(1978)AtlasofProteinSequenceandStructure(Natl.Biomed.Res.Found.,Washington),Vol.5,Suppl.3,pp.345-352.

∙S.F.Altschul(1991).Aminoacidsubstitutionmatricesfromaninformationtheoreticperspective.J.Mol.Biol.219555-565.

∙S.F.Altschul,M.S.Boguski,W.GishandJ.C.Wootton.(1994).Issuesinsearchingmolecularsequencedatabases.NatureGenetics6:

119-129.

BacktoTableofContents.

Importanceofscoringmatrices

∙Scoringmatricesappearinallanalysisinvolvingsequencecomparison.

∙Thechoiceofmatrixcanstronglyinfluencetheoutcomeoftheanalysis.

∙Scoringmatricesimplicitlyrepresentaparticulartheoryofevolution.

∙Understandingtheoriesunderlyingagivenscoringmatrixcanaidinmakingproperchoice.

Similarityvs.Distance

1.Elementsofthematricesspecifytheweighttoassignagivencomparisonby:

othecostofreplacingoneresiduewithanother(distance);or

oameasureofthesimilarityforthereplacement.

2.Distanceismorenaturallyusedforphylogenetictreereconstruction;similarityisusedfordatabasesearching.

3.Thelogicofthealgorithmdoesn'tchange:

maximizingasimilarityisfundamentallythesameasminimizingadistance.

4.Distanceandsimilaritymatricesareinter-convertiblebysomemathematicaltransformationappropriateforthegivenapplication.

BacktoTableofContents.

Examplesofmatrices

Aremarkonnotation

Whenweconsiderscoringmatrices,weencountertheconventionthatmatriceshavenumericindicescorrespondingtotherowsandcolumnsofthematrix.Thatis,

referstotheentryatthefirstrowandthefirstcolumn.Ingeneral,

referstotheentryattheithrowandthejthcolumn.Tousethisforsequencealignment,wesimplyassociateanumericvaluetoeachletterinthealphabetofthesequence.Forexample,ifthealphabetis

thenA=1,C=2,etc.Thus,onewouldfindthescoreforamatchbetweenAandCat

.Sinceweconsiderdifferentscoringmatricesinthissection,wedistinguishbetweenthembyusingdifferentlettersforthematrix,

referstotheReplacementmatrix,

tothelogoddsmatric,andsoon.

Nucleotidescoring

1.Identitymatrix(similarity)

2.ATCG

3.

4.A1000

5.

6.T0100

7.

8.C0010

9.

10.G0001

Forelementsinrowibycolumnj:

11.BLASTmatrix(similarity)

12.ATCG

13.

14.A5-4-4-4

15.

16.T-45-4-4

17.

18.C-4-45-4

19.

20.G-4-4-45

21.Transition/TransversionMatrix

22.ATCG

23.

24.A0551

25.

26.T5015

27.

28.C5105

29.

30.G1550

Nucleotidebasesfallintotwocategoriesdependingontheringstructureofthebase.Purines(AdenineandGuanine)aretworingbases,pyrimidines(CytosineandThymine)aresingleringbases.MutationsinDNAarechangesinwhichonebaseisreplacedbyanother.Amutationthatconservestheringnumberiscalledatransition(e.g.,A->GorC->T)amutationthatchangestheringnumberarecalledtransversions.(e.g.A->CorA->Tandsoon).

Althoughtherearemorewaystocreateatransversion,thenumberoftransitionsobservedtooccurinnature(i.e.,whencomparingrelatedDNAsequences)ismuchgreater.Sincethelikelihoodoftransitionsisgreater,itissometimesdesireabletocreateaweightmatrixwhichtakesthispropensityintoaccountwhencomparingtwoDNAsequences.

UseofaTransition/TransversionMatrixreducesnoiseincomparisonsofdistantlyrelatedsequences.

Proteinscoring

1.Identitymatrix

2.GeneticCodeMatrix

oScorebasedonminimumnumberofbasechangesrequiredtoconvertoneaminoacidintoanother.

o

Distancematrix

3.Physical/chemicalcharacteristics

oAttempttoquantifysomephysicalorchemicalattributeoftheresiduesandarbitrarilyassignweightsbasedonsimilaritiesoftheresiduesinthischosenproperty.

o

Hydrophobicitymatrix

BacktoTableofContents.

Logoddsmatrices

Sisthelogoddsratiooftwoprobabilities:

theprobabilitythattworesidues,iandj,arealignedbyevolutionarydescentandtheprobabilitythattheyarealignedbychance.

arethefrequenciesthatresidueiandjareobservedtoaligninsequencesknowntoberelated.Theyarederivedfroma"transitionprobabilitymatrix."

and

arefrequenciesofoccurrenceofresidueiandjinthesetofsequences.

∙e.g.,PAM250,BLOSUM62etal.

PAMMatrix

Summaryofsteps:

1.

Alignsequencesthatareatleast85%identical.

ominimizeambiguityinalignments.

ominimizethenumberofcoincidentmutations.

2.

Reconstructphylogenetictreesandinferancestralsequences.71treescontaining1,572exchangeswereused.

3.

Tallyreplacements"accepted"bynaturalselection,inallpair-wisecomparisons(each

isthenumberoftimesaminoacidjwasreplacedbyaminoacidiinallcomparisons).

4.

Computeaminoacidmutability,

i.e.,thepropensityofagivenaminoacid,j,tobereplaced.

5.

Combinedatafrom3&4toproduceaMutationProbabilityMatrixforonePAMofevolutionarydistance,accordingtothefollowingformulae:

6.

CalculateLogOddsMatrixforsimilarityscoring:

DivideeachelementoftheMutationDataMatrix,M,bythefrequencyofoccuranceofeachresidue:

RisaRelatednessOddsMatrix,

isthefrequencyofresiduei.

TheLogOddsMatrix,

iscalculatedfromtherelatednessoddsmatrix,

simplybytakingthelogofeach

.

7.

DifferentproteinfamiliesmanifestdifferentPAMrates.

BacktoTableofContents.

PropertiesofMutationProbablitiyMatrix

1.Thesumof

foranycolumn,j,isone(trivial).Notethattheprobabilitythatanaminoacidwillchangeisontheorderof1%foreachaminoacid.Theprobabilitythatitwillstaythesameisontheorderot99%foreachaminoacid.

2.TheMutationProbabilityMatrix,M1,definesaunitofevolutionarychange:

specifically,1PAM(AcceptedPointMutationper100residues).

othematrixcanbeusedtosimulateevolutionbyusingarandomnumbergeneratortoselectfateofeachresidueinthesequenceaccordingtotheprobabilitygiveninthetable.

oexposinga100residueproteinsequenceofaveragecompositiontotheevolutionarychangerepresentedbyM1resultsinoneaminoacidchange,onaverage.

3.SuccessiveapplicationofM1onasequenceyields2,3,4...PAMsofevolutionarychange.

4.Thefollowingoperationsareequivalent:

osuccessiveapplicationofM1onasequence.

omatrixmultiplicationofM1byitself,M1*M1,followedbyoperationonasequence.

oscalingtheelementsofM1byaconstantofproportionality,

=1,2,3...accordingtotheformulaebelow,followedbyoperationonasequence:

theaboveequationenablesthedirectcalculationofamatrixforanydesiredPAMdistance.

5.Thematrixhascompositionalinformationinit,sinceitdependsontherelativefrequenciesofaminoacidsinthepoolofsequencesfromwhichthetalliesweredrawn.Intheextremes,thefollowingobtain:

oTheelementsofthe0PAMmatrixare1for

and0for

;

oThe

PAMmatrixelementsapproachestheasymptoticaminoacidcomposition.

BacktoTableofContents.

AssumptionsinPAMmodel:

1.replacementatanysitedependsonlyontheaminoacidatthatsiteandtheprobabilitygivenbythetable(Markovmodel).

2.sequencesthatarebeingcomparedhaveaverageaminoacidcomposition.

SourcesoferrorinPAMmodel

1.Manysequencesdepartfromaveragecomposition.

2.Rarereplacementswereobservedtooinfrequentlytoresolverelativeprobabilitiesaccurately(for36pairsnoreplacementswereobserved!

).

3.Errorsin1PAMaremagnifiedintheextrapolationto250PAM.

4.TheMarkovprocessisanimperfectrepresentationofevolution:

Distantlyrelatedsequencesusuallyhaveislands(blocks)ofconservedresidues.Thisimpliesthatreplacementisnotequallyprobableoverentiresequence.

BacktoTableofContents.

BLOSUM(BlocksSubstitutionMatrix)Matrix

StevenHenikoffandJorjaG.Henikoff(1992).Aminoacidsubstitutionmatricesfromproteinblocks.Proc.Natl.Acad.Sci.89:

10915-10919.

1.StartingdataisconservedblocksfromBlocksdatabase.

oaligned,ungappedsequences

owidelyvaryingsimilarity,butmeasuresaretakentoavoidbiasingthesamplewithfrequentlyoccurringhighlyrelatedsequences.

2.Talliesofreplacementsaremadebystraightforwardtallyingofallpairsofalignedresidues,

oTheobser

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 总结汇报 > 学习总结

copyright@ 2008-2022 冰豆网网站版权所有

经营许可证编号:鄂ICP备2022015515号-1