Statistical approaches to achieve the critical distinction of CNAs (somatic) and CNVs (germline) events in SNP genotyping arrays are underdeveloped. We outline our approach to address this issue by extending the generative probabilistic model of an HMM method, CNA-HMMer [Shah et al., 2006], to detect and discriminate between patient-specific CNVs and somatic CNAs. This model performs segmentation of the log ratio intensity data and, for each segment, predicts discrete copy number status from the set of 5 somatic states (homozygous deletion, hemizygous deletion, gain, amplification, and high-level amplifcation), 5 analogous germline states, and neutral copy number.
HMM-Dosage is able to predict germline events even in the absence of normal samples by probabilistically incorporating prior CNV information derived from any source. This additional prior will enable the HMM to capture the distribution properties of CNVs when fitting its model parameters such that we can achieve a dichotomous output using a single, unified probabilistic framework.
Please feel free to contact Gavin Ha (gha [at] bccrc [dot] ca) if you have any questions regarding this software
Ha, G. & Shah, S. P. Distinguishing Somatic and Germline Copy Number Events in Cancer Patient DNA Hybridized to Whole-Genome SNP Genotyping Arrays, chap. 22. Methods in Molecular Biology (Springer Science and Business Media, LLC, 2013). * Use this publication for citing the hmmDosage software
C. Curtis, S. P. Shah, S.-F. Chin, G. Turashvili, O. M. Rueda, M. J. Dunning, D. Speed, A. G. Lynch, S. Samarajiwa, Y. Yuan, S. Graf, G. Ha, G. Haffari, A. Bashashati, R. Russell, S. McKinney, A. Langerod, A. Green, E. Provenzano, G. Wishart, S. Pinder, P. Watson, F. Markowetz, L. Murphy, I. Ellis, A. Purushotham, A.-L. Borresen-Dale, J. D. Brenton, S. Tavare, C. Caldas, S. Aparicio. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 2012, advance online publication. Download paper
Software download here.
CNV Frequency Files
1. Illumina 1M Duo platform
Conrad CNV Map
(Conrad et al. 2010; PMID: 19812545)
(International HapMap 3 Consortium et al, 2010; PMID: 20811451)
2. Affymetrix SNP6.0 platform
Conrad CNV Map (450 individuals)
HapMap3 CNV (1184 individuals)*
HapMap3 CNV (1258 individuals) **
(International HapMap Consortium et al, 2007; PMID: 17943122)
* – Frequency derived from published CNV results on 1184 individuals
** – Frequency derived from on 1258 AffySNP6.0 CEL files normalized by CRMAv2 (Bengtsson et al. 2009; PMID: 19535535) and analyzed by CNA-HMMer (Shah et al. 2006; PMID: 16873504).
*** – Frequency derived from 270 AffySNP6.0 CEL files analyzed with CRMAv2 and CNA-HMMer.
Reference files (for normalization)
This reference was generated by pooling 270 individuals from the HapMap dataset (International HapMap Consortium et al, 2007). In contrast to the conventional median-based pooling of values across the samples, germline CNV signals are masked in this reference. This allows HMM-Dosage to profile the full set of germline events.
- Download and extract HMMK11_0.0.01 into the desired folder
cd <$install_dir> tar xvzf CnaSnp6_HMMK18.104.22.168.tar.gz
- Install MATLAB Component Runtime (MCR). Note that the compilation was originally done in Linux, hence the MCR installation must be for 64-bit Linux (glnxa64) architecture and version 77. Unfortunately, MCRInstaller.bin is not included in the software.
You will need to specify the directory for this installation
- Alternatively, if you have a Matlab installed with the MCR compiler toolbox, then you can compile the software to work on your machine’s architecture. Simply feed the script,
compileHMMK11.minto the Matlab executable command.
cd <$install_dir>/HMMK11_0.1.0/bin/ <$matlab_dir>/matlab -nodesktop < compileHMMK11.m
Running the compiled software
There are 2 ways to run the compiled software: 1) executable or 2) shell script. These options are offered by the Matlab as a result of using the MCR compiler. If you have MCR already installed and added to your path (specifically the LD_LIBRARY_PATH environment variable) then you can use the executable; otherwise, use the shell script as it allows you to manually specify the MCR install path. In both cases, the same input/output files and parameters are required:
hmmK11LogR(infile,freqfile,paramSetFile,outfile,segoutfile,paramfile,chr) infile Input file with format in 3-columns tab-delimited: 1) chr 2) position 3) raw copy number (not in log2 scale), where 2 is baseline neutral. The program will transform the values into log2 scale as follows: log2(y/2). There are N rows, one for each probe. freqfile Frequency file for probe-level CNV prior. 9-column tab-delimited file: 1) id (can be arbitrary) 2) SNP array probe id (can be arbitrary) 3) chr 4) position 5) CNV homozygous deletion frequency 6) CNV hemizygous deletion frequency 7) CNV low-level amplification frequency 8 ) CNV med-level amplification frequency 9) CNV high-level amplification paramSetFile Parameter intialization file is a matlab binary (.mat) file. This file contains model and setting paramters necessary to run the program. See example in manual for details. outfile Output file for probe-level results. 5-column tab-delimited file: 1) chr 2) start 3) stop 4) logR 5) HMM state prediction where 1 homozygous deletion, 2 hemizygous deletion, 3 neutral, 4 low-level amplification (gain), 5 medium-level amplification, 6 high-level amplification, 7 homozygous deletion CNV, 8 hemizygous deletion CNV, 9 low-level amplification CNV, 10 medium-level amplification CNV, 11 high-level amplification CNV. States 1,2,4,5,6 are somatic CNA events. States 7-11 are germline CNV events. segoutfile Output file for segmentation results. 5-column tab-delimited file: 1) chr 2) start 3) stop 4) HMM state prediction (same as for outfile) 5) logR paramfile Output file for converged parameters after model training using Expectation Maximization (EM) algorithm. Means and precisions for each HMM class/state are saved for iteration of EM. This can be useful for comparing initial and converged values as well as checking for label switching. The output file is a Matlab binary (.mat). chr Integer denoting a chromosome. If chr is a value between [1 to 24], then all raw copy number data for N probes will be normalized by the median of the probes in chr. If chr=0, then data is normalized using median of all chromosomes. If chr=-1, data is normalized using the default neutral value (e.g. 2)
Here is an example of how to use the shell script. Refer to
runTest_script.sh to see how each input file is used. Also, notice the formatting of each input and output file in
cd <$install_dir>/HMMK11_0.1.0/test/ ./runTest.sh <$mcr_dir>/V77/
Running the software in Matlab
If you have a Matlab installed and wish to run within the Matlab environment, then you can start up Matlab and add the source files before executing the main function.
cd <$install_dir>/HMMK11_0.1.0/bin/ <$matlab_dir>/matlab >> addpath(genpath("<$install_dir>/HMMK11_0.1.0/cnahmmer")) >> addpath(genpath("<$install_dir>/HMMK11_0.1.0/stats")) >> % assuming your have all the parameters to the main function assigned... >> hmmK11LogR(infile,freqfile,paramSetFile,outfile,segoutfile,paramfile,chr)