小枫炮炮的博客分享 http://blog.sciencenet.cn/u/maplesword 呼吸着自由的空气,在计算生物学的海洋里畅泳

博文

[Paper Excerpt]Genome-wide analysis of allelic expression imbalance in human pri

已有 4946 次阅读 2010-8-28 23:58 |个人分类:未分类|系统分类:论文交流|关键词:学者| ngs, ASE

Highlight: This is an easy-understanding paper, describing a detailed process for analyzing allele-specific expression with RNA-Seq data. It is very useful and it is possible for us to follow its steps to achieve the preliminary results.

Outline:
I. Expectation: a large fraction of the loci (SNPs) to have a regulatory role on gene expression via effects on transcription, message stability and splicing.

II. Data
1. Samples: eight independently sequenced human poly(A)-selected transcriptomes obtained from primary cells from four healthy donors using high-throughput paired-end (PE) resequencing. For each of the four individuals there are two conditions: T-cell activation (stimulated) or unstimulated
2. Sequencing: Illumina GAII, 45 bp reads, most of which are paired end.
3. Mapping: Ensembl v52 CCDS was used as reference sequence set, with the additional one sequence per intron extending intron boundaries 40 bp on each side to allow mapping of reads overlapping exon-intron junctions. One sequence per non-standard exon-exon junction (up to three skipped exons) was also included. Reads were mapped using novoalign (www.novocraft.com) V2.05.12 PE mode for paired reads, and SE mode for SE data. Quality reads were defined as uniquely mapped reads with phred-scaled probability score >=20.
4. Importance of transcript coverage: the ability to detect an allelic imbalance using ASE depends on two parameters: the strength of the allelic imbalance and the read depth at the reporter heterozygous SNP. The authors analytically computed the read depth required to demonstrate allelic imbalance for different allelic ratios. Based on the results, to remove SNPs providing almost no power to detect allelic imbalance, they only tested for ASE SNPs with read depth >50.
5. SNP calling: read depth >=50, the frequency of the second most common SNP was at least 15%.
6. Quality filtering for heterozygous SNPs: to verify that the allelic call is independent of the position of the SNP within the 45 base reads, both distributions of positions (the two alleles) was compared using the Kolmogorov-Smirnov test; to check for strand-specific (forward/reverse) biases, a goodness-of-fit chisq test on the 2 by 2 table of allelic calls by strand was used.
7. Quantity: first tested 589673 dbSNPs (from Ensembl r52) for ASE and located in annotated spliced transcripts; also tested for ASE 4282208 intronic dbSNPs. Heterozygous SNPs with sufficient read depth were selected, and 4929 pairs of heterozygous SNPs/samples were able to test for allelic imbalance. Grouping these SNPs by transcripts for each of the samples provided 3107 pairs transcripts/samples with sufficient coverage for ASE analysis, with 1371 transcripts and 2701 SNPs.

III. ASE test
1. Method: testing a single SNP for allelic imbalance uses a chisq goodness-of-fit test for even frequencies of both alleles.
    * When considering several SNPs located at the same genes at one time: first used novoPhase (http://www.gene.cimr.cam.ac.uk/todd/) to do phasing. With the known phase, the counts can be summed across heterozygous SNPs, couting only once reads overlapping multiple SNPs.
2. Results: tested a total of 4929 pairs of heterozygous SNPs/samples, and 370 SNPs showed eveidence of allelic imbalance at P < 0.001, with FDR ~ 1%.
3. HapMap SNPs: 87796 dbSNPs in spliced transcripts passed HapMap quality filters. When restricting the analysis to this subset of SNPs, the ASE rate was significant lower (4.6%), showing a more reliable estimation.
4. Coverage: when restricting this analysis to HapMap SNPs with read depth >100, a higher ASE rate of 7.5% (66 of 878) was achieved.

III. Validation to the ASE analysis
1. Genotyping data: the authors genotyped the four individuals using the Illumina Quad660W BeadChip. They lowered the minimum read depth required to call SNPs in RNA-Seq data to 20, and identified 9727 pairs of SNP/samples shared between RNA-seq and genotyping chip. In these 9727 pairs, 6885 calls are homozygous based on both genotyping chip and RNA-Seq, and only 1 call is homozygous based on chip but heterozygous based on RNA-seq. In the remaining 2841 pairs which got heterozygous calls based on chip, 14 are homozygous based on RNA-seq data. Four of these calls were located in transcript SNRPN which is a known parentally imprinted gene (monoallelic expression). Three are located in ERAP2 which is with known complete cis-acting differential allelic control. Four of these are of too low read depth. The other three calls may be real.
2. Single locus validation: to confirm that some of our findings are not the consequence of techinical biases; selected four pairs of HapMap SNPs/individuals and validated them using two different locus-specific assays: clone-based allele-specific expression (C-BASE) and PeakPicker. All four initial RNA-seq results replicated and PeakPicker/C-BASE provided consistent results. For C-BASE, an allelic bias was observed even using genomic DNA (technical bias), thus using the gDNA allelic distribution as control when using chisq test.

IV. ASE analysis of disease-associated genes
1. Genes: the authors reviewed the literature and identified 79 genes previously assiciated with autoimmune disorders.
2. SNPs: includes all heterozugous SNPs even not listed in the Ensembl database but discovered using the previous method. 61 heterozygous SNPs with read depth >50 were found and 8 of them were not listed in dbSNP. These genes was located in 22 genes.
3. Counting separately each pair of SNP/individual: tested 127 pairs and 13 were imbalanced.
4. The analysis of these 13 pairs: first grouped them with the genes they were located at; did phasing to test whether the SNPs were consistent with each other.

V. Possible biases
1. Sequencing chemistry biases: for PCR biases over-amplifying identical cDNA fragment, because identical mapping location for both 45 bp reads is unlikely to occur randomly, for each set of clonal read pairs the authors only included a single read pair in the analysis; for biases that are specific to forward/reverse read direction, the authors verified that for heterozygous SNPs the allelic ratio ditribution was consistent for read counts obtained for the reverse and forward strands (so they will make no difference).
2. In silico mapping biases: the biases towards the reference allele. The authors solved these by: replaced the reference allele with the corresponding genetic ambiguity code coding both possible alleles at this SNP; relaxed the threshold on the number of mismatches for allowing reads to be mapped in order to limit the impact of a small number of mismatched SNPs. These two additions corrected most of the reference allele bias.
3. The authors excluded heterozygous SNPs with 45 bp of a called indel in the same individual from the ASE analysis. (novoalign provides calls for small indels)
4. Other biases: 1) Low-complexity or repeat sequence surrounding heterozygous SNPs is associated with an elevated ASE positive rate. The authors defined low-complexity/reoeat sequences as more than 25% of the 90 bp surrounding sequence (45 bp on each side) was masked by RepeatMasker. 2) the ASE positive rate was lower for heterozygous SNPs that passed HapMap quality filters (mentioned before).
5. Validation of additional ASE results: used individual 1 data and selected 22 transcripts with a unique imbalanced heterozygous SNP (resequencing, PeakPicker). The data suggest that several biases, some of which unknown, increase the FDR. The false positive rate among non-HapMap SNPs is ~50%. Thus "unproven quality" SNPs should only be used for ASE estimation with great caution and the presence of multiple imbalanced SNPs is required to provide convincing evidence of ASE.

VI. Other tests
1. PE vs. SE reads: PE is better because of higher coverage.
2. Sequence capture: to improve the read depth.

https://m.sciencenet.cn/blog-457844-357179.html

上一篇:[Paper Excerpt]Transcriptome genetics using second generation sequencing in a Ca

0

发表评论 评论 (0 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-5-16 16:14

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部