HMM-Dosage


cnv-cna-example1HMM for Detection of Somatic And Germline Events is a software that can analyze SNP-genotyping data of tumours to predict both somatic and germline copy number changes.

Statistical approaches to achieve the critical distinction of CNAs (somatic) and CNVs (germline) events in SNP genotyping arrays are underdeveloped. We outline our approach to address this issue by extending the generative probabilistic model of an HMM method, CNA-HMMer [Shah et al., 2006], to detect and discriminate between patient-specific CNVs and somatic CNAs.  This model performs segmentation of the log ratio intensity data and, for each segment, predicts discrete copy number status from the set of 5 somatic states (homozygous deletion, hemizygous deletion, gain, amplification, and high-level amplifcation), 5 analogous germline states, and neutral copy number.

HMM-Dosage is able to predict germline events even in the absence of normal samples by probabilistically incorporating prior CNV information derived from any source.  This additional prior will enable the HMM to capture the distribution properties of CNVs when fitting its model parameters such that we can achieve a dichotomous output using a single, unified probabilistic framework.

Please feel free to contact Gavin Ha (gha [at] bccrc [dot] ca) if you have any questions regarding this software

Publications

Ha, G. & Shah, S. P. Distinguishing Somatic and Germline Copy Number Events in Cancer Patient DNA Hybridized to Whole-Genome SNP Genotyping Arrays, chap. 22. Methods in Molecular Biology (Springer Science and Business Media, LLC, 2013).  * Use this publication for citing the hmmDosage software

C. Curtis, S. P. Shah, S.-F. Chin, G. Turashvili, O. M. Rueda, M. J. Dunning, D. Speed, A. G. Lynch, S. Samarajiwa, Y. Yuan, S. Graf, G. Ha, G. Haffari, A. Bashashati, R. Russell, S. McKinney, A. Langerod, A. Green, E. Provenzano, G. Wishart, S. Pinder, P. Watson, F. Markowetz, L. Murphy, I. Ellis, A. Purushotham, A.-L. Borresen-Dale, J. D. Brenton, S. Tavare, C. Caldas, S. Aparicio. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 2012, advance online publication. Download paper

Download Software

Software download here.

MCRInstaller.bin

CNV Frequency Files

1. Illumina 1M Duo platform
Conrad CNV Map
(Conrad et al. 2010; PMID: 19812545)
HapMap3 CNV
(International HapMap 3 Consortium et al, 2010; PMID: 20811451)

2. Affymetrix SNP6.0 platform
Conrad CNV Map (450 individuals)
HapMap3 CNV (1184 individuals)*
HapMap3 CNV (1258 individuals) **
HapMap270***
(International HapMap Consortium et al, 2007; PMID: 17943122)

* – Frequency derived from published CNV results on 1184 individuals
** – Frequency derived from on 1258 AffySNP6.0 CEL files normalized by CRMAv2 (Bengtsson et al. 2009; PMID: 19535535) and analyzed by CNA-HMMer (Shah et al. 2006; PMID: 16873504).
*** – Frequency derived from 270 AffySNP6.0 CEL files analyzed with CRMAv2 and CNA-HMMer.

 

Reference files (for normalization)

HapMap270_maskedRefMedian.txt
This reference was generated by pooling 270 individuals from the HapMap dataset (International HapMap Consortium et al, 2007). In contrast to the conventional median-based pooling of values across the samples, germline CNV signals are masked in this reference. This allows HMM-Dosage to profile the full set of germline events.

 

Installation Instructions

  1. Download and extract HMMK11_0.0.01 into the desired folder <$install_dir>
    cd <$install_dir>
    tar xvzf CnaSnp6_HMMK11.0.1.0.tar.gz
  2. Install MATLAB Component Runtime (MCR). Note that the compilation was originally done in Linux, hence the MCR installation must be for 64-bit Linux (glnxa64) architecture and version 77. Unfortunately, MCRInstaller.bin is not included in the software.
    You will need to specify the directory for this installation <$mcr_dir>
  3. Alternatively, if you have a Matlab installed with the MCR compiler toolbox, then you can compile the software to work on your machine’s architecture. Simply feed the script, compileHMMK11.m into the Matlab executable command.
    
    cd <$install_dir>/HMMK11_0.1.0/bin/
    <$matlab_dir>/matlab -nodesktop < compileHMMK11.m
    

Running the compiled software

There are 2 ways to run the compiled software: 1) executable or 2) shell script. These options are offered by the Matlab as a result of using the MCR compiler. If you have MCR already installed and added to your path (specifically the LD_LIBRARY_PATH environment variable) then you can use the executable; otherwise, use the shell script as it allows you to manually specify the MCR install path. In both cases, the same input/output files and parameters are required:


hmmK11LogR(infile,freqfile,paramSetFile,outfile,segoutfile,paramfile,chr)

infile         Input file with format in 3-columns tab-delimited:
                 1) chr 2) position 3) raw copy number (not in log2 scale),
                 where 2 is baseline neutral. The program will transform
                 the values into log2 scale as follows: log2(y/2).
                 There are N rows, one for each probe.

freqfile       Frequency file for probe-level CNV prior.
                 9-column tab-delimited file:
                 1) id (can be arbitrary)
                 2) SNP array probe id (can be arbitrary)
                 3) chr
                 4) position
                 5) CNV homozygous deletion frequency
                 6) CNV hemizygous deletion frequency
                 7) CNV low-level amplification frequency
                 8 ) CNV med-level amplification frequency
                 9) CNV high-level amplification

paramSetFile   Parameter intialization file is a matlab binary (.mat) file.
                 This file contains model and setting paramters necessary to
                 run the program.  See example in manual for details.

outfile        Output file for probe-level results.
                 5-column tab-delimited file:
                 1) chr 2) start 3) stop 4) logR
                 5) HMM state prediction where
                  1 homozygous deletion, 2 hemizygous deletion, 3 neutral,
                  4 low-level amplification (gain),
                  5 medium-level amplification,
                  6 high-level amplification,
                  7 homozygous deletion CNV,
                  8 hemizygous deletion CNV,
                  9 low-level amplification CNV,
                  10 medium-level amplification CNV,
                  11 high-level amplification CNV.
                  States 1,2,4,5,6 are somatic CNA events.
                  States 7-11 are germline CNV events.

segoutfile     Output file for segmentation results.
                 5-column tab-delimited file:
                 1) chr 2) start 3) stop
                 4) HMM state prediction (same as for outfile) 5) logR

paramfile      Output file for converged parameters after model training
                 using Expectation Maximization (EM) algorithm.
                 Means and precisions for each HMM class/state are saved for
                 iteration of EM.  This can be useful for comparing initial
                 and converged values as well as checking for label switching.
                 The output file is a Matlab binary (.mat).

chr            Integer denoting a chromosome.
                 If chr is a value between [1 to 24], then all raw
                 copy number data for N probes will be normalized by
                 the median of the probes in chr.  If chr=0, then
                 data is normalized using median of all chromosomes.
                 If chr=-1, data is normalized using the default
                 neutral value (e.g. 2)

An Example

Here is an example of how to use the shell script. Refer to runTest_script.sh to see how each input file is used. Also, notice the formatting of each input and output file in <$install_dir>/HMMK11_0.1.0/test/


cd <$install_dir>/HMMK11_0.1.0/test/
 ./runTest.sh <$mcr_dir>/V77/

Running the software in Matlab

If you have a Matlab installed and wish to run within the Matlab environment, then you can start up Matlab and add the source files before executing the main function.


cd <$install_dir>/HMMK11_0.1.0/bin/
<$matlab_dir>/matlab
>> addpath(genpath("<$install_dir>/HMMK11_0.1.0/cnahmmer"))
>> addpath(genpath("<$install_dir>/HMMK11_0.1.0/stats"))
>> % assuming your have all the parameters to the main function assigned...
>> hmmK11LogR(infile,freqfile,paramSetFile,outfile,segoutfile,paramfile,chr)