"microarrays-02-00171.pdf"

Taking too long?

Reload document

Open in new tab

401

6.7k

VIEWS

Share on Facebook Share on Twitter

Titre : "microarrays-02-00171.pdf"
Submitted by : Anonymous
Description : Comparative Analysis of CNV Calling Algorithms: Literature Survey and a Case Study Using Bovine High-Density SNP Data Lingyang Xu 1,2, Yali Hou 3, Derek M. Bickhart 4, Jiuzhou Song 2 and George E. Liu 1,* ... were shown to be one of the catalysts and hotspots for CNV formation [36–38].

Transcription

Microarrays 2013, 2, 171-185; doi:10.3390/microarrays2030171

OPEN ACCESS

microarrays

ISSN 2076-3905
www.mdpi.com/journal/microarrays

Review

Comparative Analysis of CNV Calling Algorithms: Literature
Survey and a Case Study Using Bovine High-Density SNP Data

Lingyang Xu 1,2, Yali Hou 3, Derek M. Bickhart 4, Jiuzhou Song 2 and George E. Liu 1,*

1 Bovine Functional Genomics Laboratory, BARC, BA, USDA-ARS, Beltsville, MD 20705, USA;

2 Department of Animal and Avian Sciences, University of Maryland, College Park, MD 20742, USA;

E-Mail: xulingyang2008@gmail.com

E-Mail: songj88@umd.edu

3 Laboratory of Disease Genomics and Individualized Medicine, Beijing Institute of Genomics,
Chinese Academy of Sciences, Beijing 100029, China; E-Mail: houyali1210@gmail.com

4 Animal Improvement Programs Laboratory, BARC, BA, USDA-ARS, Beltsville, MD 20705, USA;

E-Mail: Derek.Bickhart@ars.usda.gov

* Author to whom correspondence should be addressed; E-Mail: George.Liu@ars.usda.gov;

Tel.: +1-301-504-9843; Fax: +1-301-504-8414.

Received: 2 May 2013; in revised form: 4 June 2013 / Accepted: 5 June 2013 /
Published: 25 June 2013

Abstract: Copy number variations (CNVs) are gains and losses of genomic sequence
between two individuals of a species when compared to a reference genome. The data from
single nucleotide polymorphism (SNP) microarrays are now routinely used for genotyping,
but they also can be utilized for copy number detection. Substantial progress has been
made in array design and CNV calling algorithms and at least 10 comparison studies in
humans have been published to assess them. In this review, we first survey the literature on
existing microarray platforms and CNV calling algorithms. We then examine a number of
CNV calling tools to evaluate their impacts using bovine high-density SNP data. Large
incongruities in the results from different CNV calling tools highlight the need for
standardizing array data collection, quality assessment and experimental validation. Only
after careful experimental design and rigorous data filtering can the impacts of CNVs on
both normal phenotypic variability and disease susceptibility be fully revealed.

Keywords: copy number variation (CNV); algorithm; segmental duplication; single nucleotide
polymorphism (SNP); cattle genome

Microarrays 2013, 2

1. Introduction

172

Genomic structural variation, including copy number variation (CNV), has been extensively studied
in humans [1–5] and rodents [6–9]. Initial CNV reports have also been released for domesticated
animals, including dog [10–12], cattle [13,14], chicken [15,16], pig [17,18], sheep [19,20], and goat [21]
amongst others. Recent bovine CNV studies have generated several cattle CNV maps based on the
data from Illumina Bovine SNP50K microarrays [22–25].

CNVs can be identified using various approaches, including comparative genomic hybridization
(CGH) arrays, SNP arrays, and DNA sequencing. In spite of the increasing adoption of next-generation
sequencing, microarrays will continue to be the primary platform for CNV detection in the near future.
Compared to other approaches, the advantages of SNP arrays include their relative low cost and high
throughput. Substantial genotyping data have been produced from genome-wide association studies,
which can be directly exploited for CNV analysis. Dozens of human and mouse CNV studies have
demonstrated that some CNVs are associated with phenotypic traits and diseases [26–29]. Efforts to
explore the association between cattle CNV and economical traits have been published [30–32], even
though the actual functional mechanisms are not yet well defined.

2. CNV Detection Using SNP Arrays

SNP arrays were initially designed to genotype thousands of SNPs across the genome concurrently.
Their applications have now expanded to include CNV detection using additional information such as
the probe hybridization signal on each individual chip. The most well-known SNP microarrays are
available from commercial vendors such as Illumina and Affymetrix [33,34]. Both companies sell
competing arrays and continue to offer ever increasing coverage for detecting SNPs and CNVs
simultaneously. However, one important consideration is the inherent bias of the SNP chip coverage
against areas of the genome known to frequently harbor CNVs. For example, common copy number
polymorphisms (CNPs) may cause a SNP to be rejected when the SNP fails standard inheritance
checks and Hardy-Weinberg tests [35].

Segmental duplications (SDs), defined as >1 kb stretches of duplicated DNA with high sequence
identity in a species, were shown to be one of the catalysts and hotspots for CNV formation [36–38].
Although the current microarray platforms offer some detection power in SD regions, calls within
these regions are often affected by low probe density and cross-hybridization of repetitive sequence.
In addition, only a relative copy number (CN) increase or decrease is reported with respect to the
reference samples in SD regions. This poses a particular problem in the detection of CNVs in SD
regions as the test individual’s copy number may differ from that of the reference by a smaller
proportion than is detectable using array-based calling criteria. Although analyses of a subset of
CNVs provided evidence of linkage disequilibrium with flanking SNPs [39], a significant portion of
CNVs fell in genomic regions not well covered by SNP arrays, such as SD regions, and thus were not
genotyped [40–42].

Since SNP chips are primarily designed for their use in SNP genotyping, some background noise
that does not affect SNP calling may cause problems for CNV calling algorithms. For example, SNP
data is typically normalized against a reference population in order to reduce between-array variations

Microarrays 2013, 2

and probe-specific hybridization effects. The assumption that the large majority of reference samples
have the same two copies does not hold for common CNV regions. At these regions, the normalization
should be further optimized to derive correct parameters. Several new array designs have incorporated
CNV detection, for example, monomorphic probes in common CNV regions are included on more
recent Illumina and Affymetrix SNP array platforms.

173

3. Algorithms for CNV Detection

Undoubtedly, microarray development has spurred the advances in computational analysis
methodology in quantitative fields of biology. A wide range of CNV discovery tools has been developed
based on data derived from SNP arrays, such as cnvPartition [43], Birdsuite [44], PennCNV [45], and
amongst others. In this section, we briefly introduce these CNV detection tools.

cnvPartition: Illumina data can be initially viewed, processed and exported using the proprietary
GenomeStudio program (Illumina, CA, USA). In addition to quality checking and genotype calling,
the program calculates several important input values for CNV discovery. The log R ratio (LRR), i.e.,
log2(Robserved/Rexpected), is calculated from the observed normalized intensity of a sample and expected
normalized intensity, which is calculated from linear interpolation of canonical genotype clusters. The
B allele frequency (BAF, normalized measure of relative signal intensity ratio of the B and A alleles)
is calculated from the difference between the actual value and the expected position of the cluster
group. LRR and BAF are used by many CNV detection algorithms. cnvPartition is offered as a plug-in
for the GenomeStudio program, where it uses LRR and BAF to assess copy number using 14 different
Gaussian distribution models between zero and four copies. cnvPartition also uses a likelihood-based
method to compute the confidence score for each CNV call. Given the integration of cnvPartition into
Illumina proprietary software (GenomeStudio), cnvPartition is currently unable to process and analyze
Affymetrix chip data.

Birdsuite: Affymetrix SNP array data from older chips must first be analyzed in the Genotyping
Console program provided by Affymetrix for initial quality checks and controls. Data from the newer
Affymetrix chip can be processed by additional programs contained in the Birdsuite package [44].
The Canary module of Birdsuite genotypes the known common CNVs using an Expectation-Maximization
(EM) algorithm while the Birdseye module detects novel CNVs by using a Hidden Markov Model
(HMM) with a Viterbi algorithm calculating emission states. For Affymetrix SNP arrays, there are
other freely available CNV detection programs, such as GADA [46], Cokgen [47], iPattern [26] in
addition to Birdsuite. For details about these programs, please see these published reviews [35,48,49].
The developers of Birdsuite have mentioned future plans for Illumina platform support [50] but current
options only include a beta version for Illumina 610 array platforms.

PennCNV and QuantiSNP: PennCNV and QuantiSNP are two freely available programs developed
based on HMMs [45,51]. Both programs can process Illumina and Affymetrix SNP data. PennCNV
incorporates multiple sources of information, including LRR and BAF at each SNP marker, the distance
between neighboring SNPs and the allele frequency of SNPs. PennCNV also integrates a computational
approach by fitting regression models with GC content to overcome ―genomic waves‖ [52,53].
Additionally, PennCNV is capable of considering pedigree information (a parents-offspring trio)

Microarrays 2013, 2

to improve call rates and accuracy of breakpoint prediction as well as to infer chromosome-specific
SNP genotypes in CNVs. Finally, PennCNV also reports data quality control measurements for each
CNV dataset.

174

QuantiSNP, by contrast, uses an Objective Bayes approach [51] to infer copy number states based
on the LogR ratio and the B allele frequency for each SNP marker. Whereas the PennCNV algorithm
uses a transition matrix to model realistic copy number transitions between SNP probes [45],
QuantiSNP calculates Bayesian probabilities for each SNP marker pair and then uses a HMM to join
markers to form CNVs. Another significant difference between the two programs is that PennCNV is
an open-source project whereas QuantiSNP was written for MatLab, which may limit availability to
users that may not have a MatLab license. Finally, QuantiSNP is no longer under active development
as listed on its webpage [54].

Approaches originally developed for array CGH: Several tools for CNV detection, which were
originally developed for array CGH CNV calling, have been modified for SNP array analysis. However,
these methods normally do not consider BAF information, which is the preferred data source to use for
CNV calling in SNP data. For example, the Circular Binary Segmentation (CBS) method was designed
to convert noisy intensity values into neighboring segments of distinct assigned copy numbers using
dynamic programming [55]. DNAcopy is a widely used R implementation of the CBS method.

Other commercial CNV detection tools: Other commercially available programs include Partek
Genomics Suite, Nexus Copy Number software and Golden Helix SNP & Variation Suite (SVS).
The strength of these commercial tools include their graphical user interfaces, streamlined pipelines for
analysis and work flow, optimized computational speed as well as technical support. These factors are
very important to labs with limited bioinformatics support; however, commercial companies often do
not utilize some of the latest methods developed in the academic environment. For this study, we have
chosen to look in detail at the Golden Helix SVS [56]. The SVS Copy Number Analysis Module
(CNAM) employs a segmentation algorithm using only the signal intensity data to detect CNVs on
either a per-sample (univariate) or multi-sample (multivariate) basis. According to its online manual,
the univariate method, which considers only one sample at a time, is designed for detecting rare and/or
large CNVs. The multivariate method, which considers all samples simultaneously, is designed for
detecting small, common CNVs.

Comparing univariate and multivariate methods: Although the exact algorithm of each method
is proprietary, Breheny et al. explored the strengths and weaknesses of two similar approaches using
both simulations and real data [57]. In their study, the univariate method (the CNV-level testing, i.e.,
across markers within one sample) involves estimating, at the level of the individual genome, the
underlying copy number at each location. Once this is completed, tests are performed to determine the
association between copy number state and phenotype. The multivariate method (the pooled
marker-level testing across samples) carries out association testing first between the phenotypes and
raw intensities at the level of the individual marker, and then aggregates neighboring test results to
identify CNVs associated with the phenotype. Accounting for multiple comparisons across SNP
markers is more straightforward, as a multiple-comparison correction (e.g., Bonferroni, permutation)
can directly control the family-wise error rate (FWER) of the overall procedure [58]. False discovery
rates can be calculated to account for multiple comparisons with the CNV-level testing method [59];

Microarrays 2013, 2

however, this is more complicated and somewhat conservative. Partially overlapping CNVs across cell
lines introduce dependence across the tests, thereby reducing the effective number of independent
tests. Breheny et al. confirmed that that the univariate method/CNV-level testing has greater power to
detect associations involving large, rare CNVs, while the multivariate method/pooled marker-level
testing has greater power to detect associations involving small, common CNVs. It is important to
understand these tradeoffs. Several recent papers have proposed to develop methods capable of
simultaneously pooling information across both markers and samples for CNV detection and
association studies [60–64].

175

CNV quality score: Many programs like cnvPartition, Birdsuite, PennCNV and QuantiSNP
reported CNV quality scores, which are quantitative values indicating CNV confidences. Although
their exact meanings and interpretations depend on each algorithm and they are often not reported in
microarray studies. These CNV quality scores are important for constructing CNV regions, which can
then be used in association studies.

4. Comparing the CNV Detection Algorithms Using Human Data

As shown in Table 1, at least 10 comparisons of the strengths and weaknesses of these array
platforms and CNV calling tools have been published using human CNV data. Although published
results are quickly outdated as new platforms and tools are introduced, a general theme is consistent
across these comparisons. The first of these is the lack of a standard approach to collecting the data and
the lack of standardized reference samples; this makes it difficult to compare CNV results across
different studies [65]. The second is that CNV results also differ substantially depending on CNV
detection methods [35,49]. For example, as the most comprehensive study on this topic, Pinto et al.
have systematically compared CNV detection on 11 microarray platforms to evaluate data quality and
CNV calling, reproducibility, concordance across array platforms and laboratories, breakpoint
accuracy and analysis tool variability [49]. It is surprising that reproducibility in replicate experiments
is <70% for most platforms and different analytic tools applied to the same raw data typically yield CNV calls with <50% concordance. The authors attributed these poor reproducibility observations to these facts: (1) large CNVs often overlap with SDs in complex genomic regions (as we described before) and (2) large CNVs also lead to call fragmentation (a single CNV is detected as multiple smaller variants). This led the authors to conclude that, ―the striking differences between CNV calls from different platforms and analytic tools highlight the importance of careful assessment of experimental design in discovery and association studies and of strict data curation and filtering in diagnostics‖ [49]. Table 1. Survey of recent comparison studies of copy number variation (CNV) detection. Data Simulation and empirical samples for Glioblastoma Platform array CGH Custom Vendor cDNA array Conclusion Several general characteristics of future program development were suggested. Comment Earlier programs for array CGH. 176 Microarrays 2013, 2 Authors Lai [66] Year 2005 CGHseq, Quantreg, Algorithm CLAC, GLAD, CBS, HMM, Wavelet, Lowess, ChARM, GA and ACE 2007 CNAG, dChip, CNAT, GLAD Baross [67] Simulation and empirical mental retardation 100K Affymetrix SNP array SNP array Affymetrix Multiple programs were needed to find all real aberrations. Winchester [35] 2009 Birdsuite, CNAT, NA12156, NA15510 SNP array Affymetrix, Illumina Multiple predictions from different software. Dellinger [68] 2010 CBS, cnvFinder, SNP array Illumina GADA, PennCNV, QuantiSNP cnvPartition, GALD, Nexus, PennCNV and QuantiSNP Simulation and empirical samples from Singapore cohort study of the risk factors for Myopia False positive deletions was substantial, but could be greatly reduced by using the SNP genotype information to confirm loss of heterozygosity. Use software designed for the platform. The normalized singleton ratio (NSR) is proposed as a metric for parameter optimization. Tsuang [69] 2010 PennCNV, QuantiSNP, HMMSeg, and cnvPartition 48 Schizophrenia samples SNP array Illumina Given the variety of methods used, there will be many false positives and false negatives. QuantiSNP outperformed other methods based on ROC curve residuals over most datasets. Nexus Rank and SNPRank have low specificity and high power. Nexus Rank calls oversized CNVs. PennCNV detects one of the fewest numbers of CNVs. Both guidelines for the identification of CNVs inferred from high-density arrays and the establishment of a gold standard for validation of CNVs are needed. Marenne [71] SNP array Illumina Data ~1,000 Bipolar + 270 HapMap samples 96 pair samples from Spanish Bladder Cancer/EPICURO study 6 samples in triplicate on 11 array platforms Table 1. Cont. Platform Vendor SNP array Affymetrix array CGH, SNP array, and BAC array Agilent, NimbleGen, Affymetrix, and Illumina Microarrays 2013, 2 Authors Zhang [70] Pinto [49] Year Algorithm 2011 Birdsuite, Partek Genomics Suite, HelixTree, and PennCNV-affy cnvPartition, PennCNV, and QuantiSNP 2011 2011 Birdsuite, cnvFinder, cnvPartition, dCHIP, ADM-2 (DNA Analytics), Genotyping Console (GTC), iPattern, Nexus Copy Number, Partek Genomics Suite, PennCNV, QuantiSNP 2011 Birdsuite, Birdseye, PennCNV, CGHseg, DNAcopy Koike [48] HapMap samples SNP array Affymetrix Eckel-Passow [72] 2011 Affymetrix Power SNP array Affymetrix Tools (APT), Aroma.Affymetrix, PennCNV and CRLMM 1,418 GENOA (Genetic Epidemiology Network of Atherosclerosis)/FBPP (Family Blood Pressure Program) samples 177 Conclusion Birdsuite and Partek had higher positive predictive values. Comment Poor overlap between 2 gold standards (Kidd et al. and Conrad et al.). PennCNV was the most reliable algorithm when assessing the number of copies. Different analytic tools applied to the same raw data typically yield CNV calls with <50% concordance. Moreover, reproducibility in replicate experiments is <70% for most platforms. Hidden Markov model-based programs PennCNV and Birdseye (part of Birdsuite), or Birdsuite show better detection performance. Recommended trying multiple algorithms, evaluating concordance/discordance and subsequently consider the union of regions for downstream association tests. Current calling algorithms should be improved for high performance CNV analysis in genome-wide scans. The CNV resource presented here allows independent data evaluation and provides a means to benchmark new algorithms. CNV calls are disproportionally affected by genome complexity as they tend to overlap SDs and a single CNV is detected as multiple smaller variants. Segmental duplications and interspersed repeats (LINEs) are involved in CNVs. Advocated that software developers need to provide guidance with respect to evaluating and choosing optimal settings in order to obtain optimal results for an individual dataset. Microarrays 2013, 2 5. Comparing CNV Detection Algorithms Using Bovine High-Density SNP Data 178 We performed an analysis of CNVs based on Illumina BovineHD chips, which contain more than 750,000 SNP markers [73], using PennCNV. As a consequence of the higher SNP count, more CNVs were identified with higher resolution boundaries. In order to provide an additional comparison of CNV detection methods, we have tested three additional tools to call CNVs on the same BovineHD dataset: cnvPartition version 3.6.1, Golden Helix SVS 7.0 and DNAcopy [55]. These four tools were applicable to our dataset (Illumina bead array), available to us (due to existing commercial licensing or free availability) and were not designed specifically for human-based array studies. In order to perform an accurate and fair comparison of calls across the different methods, our PennCNV calls were derived from the same 630 animals of 27 cattle breeds on the cattle reference assembly UMD3.1 without using trio information [73]. We carried out cnvPartion calling using the default parameters as recommended by Illumina. For the Golden Helix SVS7.0, we used the SVS DSF Export Plug-In 4.1 to export LRRs from the GenomeStudio project. We then utilized CNAM to process the DSF file under the univariate option (minimum 3 markers/segment, a significance level of p = 0.005 for 2,000 pairwise permutations). We also performed DNAcopy analysis based on LRR. Finally, CNV segments were then filtered with a minimum of 3 probes for all 4 tools and a minimum of absolute segment mean values of 0.3 for SVS and DNAcopy. Table 2. CNVs and CNVRs identified using PennCNV, cnvPartition, SVS, and DNAcopy. Tool PennCNV Event Count CNV CNVR 3,364 a 46,751 (74.2) cnvPartition CNV 16,566 (26.3) SVS DNAcopy CNVR 1,298 a CNV CNVR 7,099 a CNV CNVR 5,961 a 41,858 (66.4) Gain 17,796 (28.2) 28,955 (46.0) 1,382 b 5,021 (8.0) 541 b Loss Average Length 2,334,244,479 (49,929) 2,376 c 147,476,461 (43,840) 11,545 (18.3) 2,191,528,246 (132,291) 916 c 172,378,730 (132,803) 92,258 (146.4) 2,234,601,290 (24,168) 7,056 c 37,389 (59.3) 5,284 c 151,471,634 (21,337) 1,863,930,368 (44,530) 194,287,154 (32,593) 78 b 4,469 (7.1) 1,457 b Numbers in parentheses are values normalized by sample counts, except in the case of the parentheses values in the ―Average Length‖ column, which are average lengths normalized by CNV counts. a These numbers represent non-redundant CNVR counts after merging both gain and loss CNVs identified across all 630 samples. b Gain CNV events were merged separately. c Loss CNV events were merged separately. 92,463 (146.8) 205 (0.3) A summary of CNV and CNVR results derived from all 630 samples is shown in Table 2. Detailed results can be found in the four worksheets of Supplementary Table 1. Compared to PennCNV results, CNVs and CNVRs in cnvPartition results are fewer and ~3 times longer (45 kb vs. 130 kb, respectively). While PennCNV and cnvPartition have loss/gain ratios of ~1.7 and DNAcopy has a ratio of 3.6, SVS has a ratio over 90, suggesting SVS is more sensitive to loss events than to gain events. Additionally, both SVS and DNAcopy CNVRs (average length approximately 20 kb and 30 kb, respectively) are shorter than PennCNV (~40 kb), and significantly shorter than cnvPartition CNVRs (~132 kb). Similar observations were also obtained when each subspecies/group (i.e. taurine, indicine, composite (taurine × indicine) and African breeds) was processed separately, confirming the above

Comparative Analysis of CNV Calling Algorithms: Literature …

Transcription

Related Posts

e.learning) dans la formation professionnelle des salariés

Non correcte CMYK RVB – Formation Emitech

associations agrées formations secours

LICENCE EN NUTRITION ET DIETETIQUE

REFERENTIEL DE FORMATION AsH - leschenes.org

Formation A - 7h (permis 125)

Leave a Reply Cancel reply

Latest documents

Recent Comments

Archives

Categories

Docs Wikilivre

Comparative Analysis of CNV Calling Algorithms: Literature …

Transcription

Related Posts

e.learning) dans la formation professionnelle des salariés

Non correcte CMYK RVB – Formation Emitech

associations agrées formations secours

LICENCE EN NUTRITION ET DIETETIQUE

REFERENTIEL DE FORMATION AsH - leschenes.org

Formation A - 7h (permis 125)

Leave a Reply Cancel reply

Trending Categories

Latest documents

Recent Comments

Archives

Categories

Docs Wikilivre