1、遗传多态性知识汇总遗传多态性知识一、SNP, LD, Haplotype and Tagger SNP1. 遗传/基因多态性(genetic/gene polymorphism)在一随机婚配的群体中,染色体同一基因座位点上有两种或两种以上的基因型,且各个等位基因在群体中的出现频率皆高于1%。它是决定人体对疾病易感性、临床表现多样性及药物治疗反应差异性的重要因素。而种群中频率等于或小于1 %的碱基变异称为突变。染色体同一DNA位置上的每个碱基类型叫做一个等位位点。如某些人的染色体上某一位置的碱基是A,而另一些人的染色体上相同位置上的碱基是G,除性染色体外,每个人体内的染色体都有两份,所以,一个人
2、所拥有的一对等位位点的类型被称作基因型(genotype),如GA、GG、AA;检定一个人的基因型,被称作基因分型(genotyping)。由不同基因型与环境共同作用所产生的生物体(人类)可观测的物理或生理性状称为表现型(phenotype)。限制性片段长度多态性(restriction fragment length polymorphism. RFLP)是第一代的遗传标记;可变数目的串联重复(variable number of tandem repeat. VNTR)是第二代遗传标记;其中重复单位为2-6个核苷酸称为微卫星或短串联重复;6-12个核苷酸称为小卫星。Polymorphism
3、s are defined as frequent (occurring in greater than 1% of the population) variations in the human DNA sequence. Most involve a single base pair substitution, known as single nucleotide polymorphisms(1), although more complex variations are also recognised. SNPs are single base pair positions in gen
4、omic DNA at which different sequence alternatives (alleles) exist in normal individuals in some population(s), wherein the least frequent allele has an abundance of 1% or greater. In principle, SNPs could be bi-, tri-, or tetra-alletic polymorphisms. Howere, in humans, tri-alletic and tetra-alletic
5、SNPs are rare almost to the point of non-existence, and so SNPs are sometimes simply referred to as bi-allelic markers. 单核苷酸多态性(single nucleotide polymorphism.SNP):最早由美国麻省理工学院的人类基因组研究中心Lander于1996年提出,是不同个体基因组DNA序列内特定核苷酸位置上单个碱基的不同是第三代遗传标记,任一SNP在群体中出现的频率应不小于1%,原则上SNP可以是双、三、四等位基因多态,在人类三、四等位基因的SNP很少甚至几乎
6、不存在,因此SNP简单指双等位标记,双等位基因的SNP替换包括1个转换CT(GA)和3个颠换CA(GT)、CG(GC)、TA(A T),由于核苷酸的5-甲基胞嘧啶脱氢基反应相对比较频繁,使得四种SNPs在基因组中出现的频率不同,在生物体内约2/3是C/T(G/A)转换,并且多存在于非转录序列中。据统计,人类基因组中3*109碱基中至少存在着1000万个SNPs位点,平均约1个SNP/1000bp。与其他遗传标记(如限制性片段长度多态,短串联重复)的主要不同是不再以“长度”的差异作为检测的手段,而直接以序列的变异作为标记,具有高丰度、高度稳定性和易于自动化分析等独特的优势。英文描述:SNP ma
7、rkers are preferred over microsatellite markers for association studies, because of their high abundance along the human genome (SNPs with minor allele frequency0.1 occur once every 600 kb) (Wang et al.1998), their low mutation rate, and the accessibility of high-throughput genotyping. The power of
8、association studies based on SNPs depends not only on the sample size and density of the marker map but also on many other factors, such as the age and frequency of the disease mutations and SNPs and the extent of linkage disequilibrium(LD) in the region.(2)根据SNP在基因序列中所处的位置的不同,SNP位点可以分为几个大类。大多数对基因的功
9、能没有影响的SNPs,称为anonymous SNPs;存在于基因内部的SNP位点则称为gene-based SNPs,包括内含子、外显子和启动子中的单核苷酸多态性位点。其中,存在于蛋白质编码序列中的SNP位点称为cSNPs或coding SNPs。在cSNPs中,如果不改变所编码的氨基酸序列,这样的单核苷酸多态性称为synonymous SNPs;如果SNP导致了氨基酸序列的改变,则称为non-synonymous SNPs。发生在基因蛋白编码区的SNP,可能引起编码氨基酸的置换,导致蛋白功能的改变;大多数SNPs发生在非编码区,启动子区域的SNP也许影响转录因子结合的能力,改变基因转录的速
10、率或水平;发生在5上游区或3下游区域的SNPs可能改变转录的mRNA的稳定性或增强子活性;而内含子区域的SNPs的功能效应有待于进一步研究(3)。检测SNP的方法多种多样,有直接测序法、PCR-RFLP法、单链构型多态分析法(single strand conformation polymorphism analysis,SSCP)、异源双链分析法(heteroduplex analysis,HA)、变性梯度凝胶电泳分析法(denaturing gradient gel electrophoresis,DGGE)、固相化学断裂法(solid phase chemical cleavage me
11、thod,spCCM)、等位基因特异性聚合酶链反应法(allele-specific PCR)、DNA芯片检测法和实时荧光定量PCR法等,均具有较高的特异性和敏感性,不同实验室可以根据研究目的和经费选择合适的检测方法。2. 单倍型(haplotype)位于染色体上特定区域、相互关联、倾向于以整体模式遗传给后代的SNPs组合称作单倍型(haplotype),比拟为人类进化历史的“分子化石”。在一段DNA内若存在n个SNP位点,则群体内理论上可能存在2n种单倍型,但针对每一个体来说只有2种单倍型。单倍型构建方法:实验方法目前有单分子稀释法(single-specific dilution)、AP-
12、PCR(allele-specific PCR)、长插入克隆法(Long-insert cloning)与双倍型-单体型转化(diploid-to-haploid conversion)等;统计算法有Clark算法、最大似然算法、贝叶斯算法。3. 单倍域(haplotype block)根据基因组大范围内SNPs之间的连锁不平衡,能够用一个相对简单的模型来描述人类基因组的单倍型结构,即染色体上存在的连续的、稳定的、几乎没有被重组所打断的单倍型区域,称为单倍域(haplotype block or haploblocks)。Several neighboring, tightly linked
13、SNPs are inherited together and form a haplotype block.单倍域可能是遗传的最小单位,在极端情况下,它可以是一个单独的SNP或者是一整条染色体,重组事件频发的区域可将相邻的单倍域间隔开来。3.1 单倍域的定义:a haplotype block is a contiguous set of markers in which the average D(the standardized coefficient of LD(4) is greater than some predetermined threshold. Gabriel et al
14、(5) described human genome can be parsed objectively into haplotype blocks: sizable regions over which there is little evidence for historical recombination and within which only a few common haplotypes are observed. based on linkage disequilibrium (LD), that is large pairwise |D| values between tho
15、se SNP pairs within one haploblock. Patil et al(6) defined haplotype blocks as a region with a large proportion(80%) of inferred common haplotypes.based on the concept of “chromosome coverage” , with a haplotype block containing a minimum number of SNPs that account for a majority of common haplotyp
16、es or a reduced level of haplotype diversity.Wang et al(7) further proposed explicit“no historical recombination” as a definition for haplotype blocks, which can be tested using a four-gamete test.Ding K et al(8) choose to define haplotype blocks based on LD when haplotype-block-based tSNPs selectio
17、n methods were employed. The LD-based haplotype-block definition requires that the proportion of SNP pairs with strong D(absolute D0.70) must account for at least 95% of pairs of SNPs3.2 单倍域的算法及划分标准:3.2.1 基于连锁不平衡:Gabriel Criteria(5) of haplotype block partitioning:v Exclude MAF of SNPs below 0.05 v
18、“strong LD” is defined that if the one-sided upper 95% confidence bound on Dis 0.98 (that is, consistent with no historical recombination) and the lower bound is above 0.7. v “strong evidence for historical recombination” pairs for which the upper confidence bound on D is less than 0.9. We defined a
19、 haplotype block as a region over which a very small proportion (80%)of inferred common haplotypes. 提出了获得单体域近似分割的贪婪算法,首先考虑由连续SNPs形成的所有可能的单体域,然后从中选出一个单体域,使得该域中的SNP数目与所需最少的标签SNPs(用来区分的出现一次以上单体型)数目之比值达到最大,也就是用最少的标签SNP区分出最多的SNP;每个SNP都被安排一个单体域中.所有单体域的大小与其在染色体上的顺序无关,且单体域没有绝对的边界。Two criteria:(1) in each bl
20、ock, at least 80% of the observed haplotypes are represented more than once; and (2) the total number of tag SNPs for distinguishing at least 80% of haplotypes is as small as possibleZhang et al(10-11)提出了单体域分割的动态程序算法,算法的原理是使每个单体域中能代表域中大部分性质的标签SNPs达到最少,他们的算法已经被开发为程序HAPBLOCK(http:/ hto-b.usc.edu/msms/
21、HapBlock/)。尽管上述方法各具优点,但Wall et al(12)指出更倾向于第一类方法,原因:其一,使用D直接检测历史性重组的发生看起来更符合单体域的定义;其二,对于二倍体的遗传数据,两两配对的方法更容易应用;最后,两两配对连锁不平衡的系数更易于可视化。3.2.3 其余划分标准v haplotype block boundaries were inferred from the phased genotype data (probability threshold for correct phase call at each site: 0.95) by D confidence l
22、imits (upper confidence limit 0.97, lower confidence limit 0.70, fraction of informative pairs in strong LD: 0.95) using Haploview (http:/www.broad.mit.edu/personal/jcbarret/haploview/) v 所有两两SNP之间的D值最小值0.9(13-14)v 所有两两SNP之间的r2值和D值均等于1(15)v 所有两两SNP之间的r2值最小值0.8(16)v 95%的两两SNP之间的D值最小值0.7(8)Several nei
23、ghboring, tightly linked SNPs are inherited together and form a haplotype block, which as a haploblock has a higher discrimination power than the individual SNPs within the block. Candidate haplotype blocks were selected from three major populations(Caucasian, East Asian, and African) using the foll
24、owing parameters: maximum match probability reduction=0.85, linkage disequilibrium (LD) r20.7, maximum Fst=0.06(17), minimum number of SNPs=3, minimum heterozygosity=0.2, and minimum number of haplotypes=3.(18)4. 标签SNP(tagger SNP)对于一个连锁群来说其可能包含有很多SNP位点,但是只需用少数几个SNPs就足以特异性地鉴定出该连锁群的单体型模式,而这样的SNPs被称为标签
25、单核苷酸多态性(tag single nucleotide polymorphism,tSNPs),是基因组中具有代表性和特征性的SNP,是构建单倍型或进行关联分析所必需的一组遗传标记。而仅通过少数SNP等遗传标记就可以识别单倍域中的大部分单倍型,这些遗传标记被称为单倍型标SNP,称为单倍型标签SNP(haplotype tag SNP htSNP) (19)。4.1 tSNP和htSNP的区别The two terms, htSNPs and tSNPs, refer to two different strategies(8) for choosing the optimal minimu
26、m subset of SNPs from the entire set of SNPs. htSNPs are selected based on the haplotype-block model of LD pattern in a region of interest and represent the common haplotypes inferred from the original set of SNPs. On the other hand, tSNPs are selected based on measures of association, such that a t
27、SNP predicts partially or completely the state of other SNPs. 4.2 挑选tSNP或htSNP方法分类Eight methods can also be classified as haplotype block-based methods: All common haplotypes, Haplotype diversity, R2h (Coefficient of determination), and Entropy and haplotype-block-free methods: TagIT (Haplotype r2), LD r2 (based on pairwise LD), PCA (principal component analysis), and BEST (based on set theory). LD level is based on the following c
copyright@ 2008-2022 冰豆网网站版权所有
经营许可证编号:鄂ICP备2022015515号-1