蛋白质家族和结构域DOC.docx

资源描述

蛋白质家族和结构域DOC.docx

《蛋白质家族和结构域DOC.docx》由会员分享，可在线阅读，更多相关《蛋白质家族和结构域DOC.docx（14页珍藏版）》请在冰豆网上搜索。

蛋白质家族和结构域DOC.docx

蛋白质家族和结构域DOC

1蛋白质家族和结构域数据库

1.1蛋白质模体及结构域数据库

模体和结构域

PROSITE数据库

PRINTS数据库

BLOCKS数据库

ProDom数据库

Pfam数据库

SMART数据库

InterPro数据库

ConservedDomain数据库

CDART

模体（motifs）和结构域（domains）：

Biologistscangaininsightoftheproteinfunctionbasedonidentificationofshortconsensussequencesrelatedtoknownfunctions.Theseconsensussequencepatternsaretermed motifsanddomains.

A motif isashortconservedsequencepatternassociatedwithdistinctfunctionsofaproteinorDNA.

Itisoftenassociatedwithadistinctstructuralsiteperformingaparticularfunction.

Atypicalmotif,suchasaZn-fingermotif,istentotwentyaminoacidslong.

A domain isalsoaconservedsequencepattern,definedasanindependentfunctionalandstructuralunit.

Domainsarenormallylongerthanmotifs.

Adomainconsistsofmorethan40residuesandupto700residues,withanaveragelengthof100residues.

Adomainmayormaynotincludemotifswithinitsboundaries.

Examples，transmembranedomains， ligand-bindingdomains.

Identificationofmotifsanddomainsheavilyrelieson multiplesequencealignment aswellasprofileand hiddenMarkovmodel（HMM） construction

PROSITE（蛋白质家族及结构域数据库）：

Thefirstestablishedsequencepatterndatabase www.expasy.org/prosite/

是蛋白质家族和结构域数据库，包含具有生物学意义的位点、模式、可帮助识别蛋白质家族的统计特征。

PROSITE中涉及的序列模式包括酶的催化位点、配体结合位点、与金属离子结合的残基、二硫键的半胱氨酸、与小分子或其它蛋白质结合的区域等。

PROSITE还包括根据多序列比对而构建的序列统计特征，能更敏感地发现一个（未知）序列是否具有相应的特征。

Thefunctionalinformationofthesepatternsisprimarilybasedonpublishedliterature.

PRINTS（蛋白质模体指纹数据库）：

Afingerprintisagroupofconservedmotifsusedtocharacteriseaproteinfamily;itsdiagnosticpowerisrefinedbyiterativescanningofaSWISS-PROT/TrEMBLcomposite.Usuallythemotifsdonotoverlap,butareseparatedalongasequence,thoughtheymaybecontiguousin3D-space.. http:

//bioinf.man.ac.uk/dbbrowser/PRINTS/

提供蛋白质同源性分析，蛋白质模体指纹分析，系统发生和序列进化分析，以及微阵列分析，并提供生物信息学和PRINTS数据库数据下载。

BLOCKS:

Adatabaseofblocks

Blocks：

ungappedmultiplealignmentsderivedfromthemostconserved,ungappedregionsofhomologousproteinsequences.

Theblocks,whichareusuallylongerthanmotifs,aresubsequentlyconvertedtoPSSMs.

Becauseblocksoftenencompassmotifs,thefunctionalannotationofblocksisthusconsistentwiththatforthemotifs

http:

//blocks.fhcrc.org/blocks.

检测和鉴定蛋白质模体，有BLOCKsearch、GetBlocks和BlockMaker工具

Aquerysequencecanbeusedtoalignwithprecomputedprofilesinthedatabasetoselectthehighestscoredmatches.

ProDom

Domaindatabase

ProDomisacomprehensivesetofproteindomainfamiliesautomaticallygeneratedfromtheSWISS-PROTandTrEMBLsequencedatabases

ThedomainsarebuiltusingrecursiveiterationsofPSI-BLAST.

http:

//prodom.prabi.fr/prodom/current/html/home.php

提供相似性搜索、来自SWISSPROT相关结构域的多序列比对

Pfam（ProteinfamiliesdatabaseofalignmentsandHMMs）

Adatabasewithproteindomain

derivedfromsequencesinSWISSPROTandTrEMBL.EachmotifordomainisrepresentedbyanHMMprofilegeneratedfromtheseedalignmentofanumberofconservedhomologousproteins. http:

//pfam.janelia.org/

ThePfamdatabaseiscomposedoftwoparts

Pfam-Ainvolvesmanualalignments

Pfam-B,automaticalignmentinawaysimilartoProDom（ PSI-BLAST ）.

ThefunctionalannotationofmotifsinPfam-AisoftenrelatedtothatinPROSITE.Pfam-BonlycontainssequencefamiliesnotcoveredinPfam-A.

Becauseoftheautomaticnature,Pfam-BhasamuchlargercoveragebutisalsomoreerrorpronebecausesomeHMMsaregeneratedfromunrelatedsequences.

SMART （SimpleModularArchitectureResearchTool）：

ContainsHMMprofilesconstructedfrommanuallyrefinedproteindomainalignments. http:

//smart.embl-heidelberg.de/

Alignmentsinthedatabasearebuiltbasedon

tertiarystructureswheneveravailable

orbasedonPSI-BLASTprofiles.

AlignmentsarefurthercheckedandrefinedbyhumanannotatorsbeforeHMMprofileconstruction.

Proteinfunctionsarealsomanuallycurated.

The databasemaybeofbetterqualitythanPfamwithmoreextensivefunctionalannotations.

ComparedtoPfam,theSMARTdatabasecontainsanindependentcollectionofHMMs,withemphasisonsignaling,extracellular,andchromatin-associatedmotifsanddomains.

Sequencesearchinginthisdatabaseproducesagraphicaloutputofdomainswithwell-annotatedinformationwithrespecttocellularlocalization,functionalsites,superfamily,andtertiarystructure

InterPro：

Anintegratedpatterndatabase www.ebi.ac.uk/interpro/

ThedatabaseintegratesinformationfromPROSITE,Pfam,PRINTS,ProDom,andSMARTdatabases.

Thesequencepatternsfromthefivedatabasesarefurtherprocessed.Onlyoverlappingmotifsanddomainsinaproteinsequencederivedbyallfivedatabasesareincluded.

Apopularfeatureofthisdatabaseisagraphicaloutputthatsummarizesmotifmatchesandhaslinkstomoredetailedinformation.

CDD（ConservedDomainDatabase）

acollectionofmultiplesequencealignmentsforancientdomainsandfull-lengthproteins. http:

//www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml

TheCD-Searchservicemaybeusedtoidentifytheconserveddomainspresentinaproteinquerysequence:

http:

//www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi

RPS-BLAST（ReversePSI-BLAST）isthesearchtoolusedintheCD-Searchservice.

usesaquerysequencetosearchagainstapre-computedprofiledatabasegeneratedbyPSI-BLAST.TheroleofthePSSMhaschangedfrom"query"to"subject",hencetheterm"reverse"inRPS-BLAST.

ItperformsonlyoneiterationofregularBLASTsearchingagainstadatabaseofPSI-BLASTprofilestofindthehigh-scoringgappedmatches.

CDART （ConservedDomainArchitecture）:

Adomainsearchprogram www.ncbi.nlm.nih.gov/BLAST/

CombinestheresultsfromRPS-BLAST,SMART,andPfam.

Theresultingdomainarchitectureofaquerysequencecanbegraphicallypresentedalongwithrelatedsequences.

CDARTisnotasubstituteforindividualdatabasesearchesbecauseitoftenmissescertainfeaturesthatcanbefoundinSMARTandPfam.

1.2 蛋白质家族数据库

COG （ClusterofOrthologousGroups）:

Aproteinfamilydatabasebasedonphylogeneticclassification. www.ncbi.nlm.nih.gov/COG/

Itisconstructedbycomparingproteinsequencesencodedincompletelysequencedgenomes.

Unicellularclusters：

检索工具为COGnitorprogram

EukaryoticClusters：

检索工具为KOGnitor

Aquerysequencecanbeassignedfunctionifithassignificantsimilaritymatcheswithanymemberofthecluster.

ProtoNet:

AdatabaseofclustersofhomologousproteinssimilartoCOG. www.protonet.cs.huji.ac.il/

OrthologousproteinsequencesintheSWISSPROTdatabaseareclusteredbasedonpairwisesequencecomparisonsbetweenallpossibleproteinpairsusingBLAST.

ProteinrelatednessisdefinedbytheE-valuesfromtheBLASTalignments.

Aqueryproteinsequencecanbesubmittedtotheserverforclusteridentificationandfunctionalannotation.

1.3、蛋白质结构数据库

PDB（ProteinDataBank）

PDB中含有通过实验（X射线晶体衍射，核磁共振NMR）测定的生物大分子的三维结构

蛋白质

核酸

糖类

其它复合物

http:

//www.rcsb.org/pdb

SCOP（StructuralClassificationofProteins ）蛋白质结构分类数据库

提供关于已知结构的蛋白质之间结构和进化关系的详细描述，包括蛋白质结构数据库PDB中的所有条目。

http:

//scop.mrc-lmb.cam.ac.uk/scop/

SCOP数据库除了提供蛋白质结构和进化关系信息外，对于每一个蛋白质还包括下述信息：

到PDB的连接，序列，参考文献，结构的图像等。

可以按结构和进化关系对蛋白质分类，分类结果是一个具有层次结构的树，其主要的层次是家族、超家族和折叠:

家族：

具有明显的进化关系

超家族：

具有远源进化关系，具有共同的进化源

折叠类：

主要结构相似

DSSP（蛋白质二级结构数据库）

对生物大分子数据库PDB中的任何一个蛋白质，根据其三维结构推导出对应的二级结构。

http:

//www.sander.embl-heidelberg.de/dssp/

对研究蛋白质序列与蛋白质二级结构及空间结构的关系非常有用

除了二级结构以外，DSSP还包括蛋白质的几何特征及溶剂。

HSSP（蛋白质同源序列比对数据库）

二级数据库 http:

//www.sander.embl-heidelberg.de/hssp/

数据来源于PDB，或来源于SWISS-PROT

对于PDB中的每一个蛋白质，HSSP将与其同源的所有蛋白质序列对比排列起来，从而将相似序列的蛋白质聚集成结构同源的家族。

HSSP有助于分析蛋白质的保守区域，研究蛋白质的进化关系，有助于蛋白质的分子设计。

1.4、其它生物大分子数据库

MMDB （MolecularModelingDatabase）

MMDB 是（NCBI）Entrez的一个部分，数据库的内容包括来自于实验的生物大分子结构数据。

http:

//www.ncbi.nlm.nih.gov/entrez/query.fcgi?

db=Structure

与PDB相比，对于数据库中的每一个生物大分子结构，MMDB具有许多附加的信息，如分子的生物学功能、产生功能的机制、分子的进化历史等。

还提供生物大分子三维结构模型显示、结构分析和结构比较工具。

dbSNP（ Singlenucleotidepolymorphisms，单核苷酸多态性数据库）

http:

//www.ncbi.nlm.nih.gov/entrez/query.fcgi?

db=snp

OMIM （OnlineMendelianInheritanceinMan）

是关于人类基因和遗传疾病的分类数据库

该数据库收集了已知的人类基因及由于这些基因突变或者缺失而导致的遗传疾病。

http:

//www.ncbi.nlm.nih.gov/entrez/query.fcgi?

db=OMIM

EPD

真核基因启动子数据库 http:

//www.epd.isb-sib.ch/

提供从EMBL中得到的真核基因的启动子序列，目标是帮助实验研究人员、生物信息学研究人员分析真核基因的转录信号。

TRRD （TranscriptionRegulatoryRegionsDatabase ）

关于基因调控信息的集成数据库

该数据库搜集真核生物基因转录调控区域结构和功能的信息。

每一个TRRD的条目对应于一个基因，包含特定基因各种结构－功能特性 http:

//wwwmgs.bionet.nsc.ru/mgs/gnw/trrd/

2 蛋白质功能预测

蛋白质结构与功能的研究已有相当长的历史，由于其复杂性，对其结构与功能的预测不论是方法论还是基础理论方面均较复杂。

蛋白质功能预测的一般过程：

数据库同源性搜索——根据同源信息预测功能

未知蛋白质序列（结构）是否和已知功能蛋白质的序列（结构）相似

根据序列特征预测功能

蛋白质的许多特性可直接从序列上分析获得，如疏水性，它可以用于预测序列是否位跨膜螺旋（transmenbranehelix）或是前导序列（leadersequence）。

模体或结构域搜索——通过比对模体或结构域数据库确定功能

未知蛋白包含保守的模体或结构域，则具有该模体和结构域的功能

2.1根据同源信息预测功能

相似序列 → 同源性 → 相似功能

数据库的相似性搜索是最可靠的确定蛋白质功能的方法。

一个显著的匹配应至少有25%的相同序列和超过80个氨基酸的区段。

一般的策略是

首先进行BLAST检索，

如果不能提供相关结果，运行FASTA；

如果FASTA也不能得到有关蛋白质功能的线索，最后可选用完全根据Smith-Waterman算法设计的搜索程序。

BLITZ：

www.ebi.ac.uk/searches/blitz.html

ParAlign：

http:

//www.paralign.org/

2.2 根据序列特征预测功能

查找未知蛋白中是否包含与特定蛋白质家族或功能域有关的亚序列或保守区段

Transmembrane

Signalpeptide

Domain&Motif

Coiledcoil

SubcellularLocation

二级结构

疏水性信息分析

ProtScale ：

http:

//www.expasy.org/cgi-bin/protscale.pl

能计算超过50种蛋白质的特性。

可通过输入框将序列粘贴进去，也可输入SWISS-PROT的记录号。

设定输入框的宽度参数，该参数将指示系统每

展开阅读全文