bart_posterior_probmutationSeq is software for somatic SNV detection using next generation sequencing (NGS) data from matched tumour/normal samples. It uses a feature-based classifier trained on validated somatic mutation samples while benefiting from other available information such as base quality, mapping quality, strand bias and tail distance. Given paired normal/tumour bam files, mutationSeq will output the probability of each candidate site being somatic.

Feel free to contact Jiarui Ding (jiaruid [at] cs [dot] ubc [dot] ca) if you have any questions regarding this software,


Jiarui Ding; Ali Bashashati; Andrew Roth; Arusha Oloumi; Kane Tse; Thomas Zeng; Gholamreza Haffari; Martin Hirst; Marco A. Marra; Anne Condon; Samuel Aparicio; Sohrab P. Shah. “Feature based classifiers for somatic mutation detection in tumour-normal paired sequencing data.” Bioinformatics 2011; doi: 10.1093/bioinformatics/btr629 [PDF]


Software download here.


This package is only supported on linux operating systems.
To install the package, simply extract the downloaded museq_4.3.8.tar.gz file into the desired directory. Using a terminal in linux

Download and extract into the desired folder <$install_dir>

  •  cd<$install_dir>
  •  tar -xvf

It might be required to recomplie the To do so, simply use make command and specify python and boost paths:

  • make PYTHON=/path/to/python BOOSTPATH=/path/to/boost


mutationSeq has the following dependencies:

  • python (tested for version 2.7.x) with the following packages installed:
    • numpy (tested for version 1.7.1 and highly recommended to link against BLAS)
    • scipy (tested for version 0.12.0)
    • scikits-learn (tested for version 0.14.1)
    • matplotlib (tested for version 1.2.1)
    • intervaltree (tested for version 2.0.4)
  • bamtools (tested for version 2.3.0 but modified slightly to meet our needs. It is provided in the package.)
  • boost (tested for version 1.51.0)
  • Install LAPACK package and make sure to export LD_LIBRARY_PATH to the location of the LAPACK/ATLAS libraries


To use the mutationSeq package, there are two python wrappers around the pybam engine:


Given a matched tumour/normal bam files, returns the probability that each candidate site is somatic. It employs the model generated by

The easiest way to run mutationSeq to analyze a whole genome or exon capture is to run with the default model_v4.1.2.npz provided in the package. It is also possible to build a custom model based on new training data by running first.


Using a terminal in linux:

cd <$mutationSeq> python normal:normal.bam tumour:tumour.bam reference:reference.fasta model:model.npz                           [–options]

 For the single sample version, simply use a single bam file instead of a pair and use option -s/–single:

python normal:normal.bam reference:reference.fasta model:model.npz –single [–options]


python tumour:tumour.bam reference:reference.fasta model:model.npz –single [–options]
Description should be run using a python command. It is recommended to use python 2.7 (since python 3’s new features have not yet been fully tested on this version of mutationSeq). A list of colon-delimited sample names should be specified as input arguments as following:

normal /path/to/normal_file.bam
tumour /path/to/tumour_file.bam
reference /path/to/reference.fasta
model: /path/to/model.npz (use the models provided in the package unless you train a new mode using

The following options are available:

-h –help

print help – optional

-a –all

force to print out the result for every position even if their predicted probabilities are less than the specified

threshold – optional

-b –buffer_size

specify max amount of memory usage – optional 

-c –config

specify /path/to/metadata.config that is used to add meta information to the output file as in the standard VCF4.1 – mandatory 


specify min depth (coverage) to be considered – optional

-d –deep

deepseq data analysis – optional

-e –export

specify /path/to/extracted_featuers.tsv to save exported feature vector – optional 

-f –positions_file

input a file containing a list of positions each of which is in a separate line – optional 

-i –interval

specify [chr]:[start]-[end] range – optional

-l –log_file

specify /path/to/file.log to write the log information – optional 


cancel out the heuristic conditions for somatic candidacy and force mutatioSeq to analyze all the input positions –


-n –normalized

use this option if using Nmodel.npz, otherwise it will extract regular features which match model.npz, deprecated – optional 


specify max variant percentage in the normal bam file – optional

-o –out

specify /path/to/output.vcf – mandatory

-p –purity

pass sample purity as an input – optional

-q –quality_threshold

filter out the reads with mapping quality less than –quality_threshold –optional 

-s –single

single sample analysis – optional

-t –threshold

filter out the predicted probabilities greater than or equal to –threshold – optional 


specify min number of variants in the tumour bam file – optional


only extract features – optional

-v –verbose

verbosity – optional


print version of the software – optional 

  •  When analyzing a whole genome/chromosome, using –no_filter option will make mutationSeq to stop nominating positions for somatic call prior to its analysis which in turn would result in a longer run time due to the larger number of positions that mutationSeq would analyze.


This is a typical example that would run on the specified tumour/normal pair bam files using the human_all.fasta reference. It would be run for chromosome 1 for positions in the range [123456, 345678] and the results are written as out.vcf in the specified directory.

Note that it would print only those positions whose predicted probabilities are greater than 0.3.

cd <$mutationSeq>
python normal:/share/…/DAH145N_A12970_3_lanes_dupsFlagged.bam
tumour:/share/…/DAH145_A12958_3_lanes_dupFlagged.bam reference:/share/…/reference/human_all.fasta
model:/share/…/mutationSeq/model_v4.1.2.npz –config /share/…/mutationSeq/metadata.config
–interval 1:123456-345678 –threshold 0.3 –out /share/…/out.vcf
  • The fasta and bam files used as input arguments need to be pre-indexed by samtools.
  • Use metadata.config provided in the package for –config option


The output of is formatted as standard VCF4.1. It consists of two parts: meta information and data lines. Below is a sample meta information from a output file:

##fileDate=$date of the file
##reference= $the reference used in the command line
##tumour=$the tumour file used in the command line
##normal=$the normal file used in the command line
##threshold= $ the value specified as threshold in the command line
##INFO=<ID=PR,Number=1,Type=Float,Description=”Probability of somatic mutation”>
##INFO=<ID=TR,Number=1,Type=String,Description=”Count of tumour with reference to REF”>
##INFO=<ID=TA,Number=1,Type=String,Description=”Count of tumour with reference to ALT”>
##INFO=<ID=NR,Number=1,Type=String,Description=”Count of normal with reference to REF”>
##INFO=<ID=NA,Number=1,Type=String,Description=”Count of normal with reference to ALT”>
##INFO=<ID=TC,Number=1,Type=String,Description=”Tri-nucleotide context”>
##INFO=<ID=NI,Number=1,Type=String,Description=”Number of insertions in the vicinity of a position”>
##INFO=<ID=ND,Number=1,Type=String,Description=”Number of deletions in the vicinity of a position”>
##FILTER=<ID=threshold,Description=”Threshold on probability of positive call”>

The meta information header is self explanatory. To read more about VCF4.1 please refer to VCF4.1 description. The data line section of the output looks like the following:

1    249    .    A    G    0.60    PASS    PR=0.129;TR=62;TA=4;NR=112;NA=0;TC=AAA;NI=0;ND=0
1    268    .    A    C    0.65    INDL    PR=0.139;TR=55;TA=5;NR=96;NA=0;TC=TAA;NI=3;ND=1
1    274    .    A    G    1.81    PASS    PR=0.341;TR=58;TA=3;NR=95;NA=0;TC=TAA;NI=0;ND=0
1    280    .    A    C    0.41    FAIL    PR=0.09;TR=52;TA=1;NR=88;NA=1;TC=TAA;NI=0;ND=0

where REF is the reference nucleotide at the position POS of chromosome CHROM. ALT is the nucleotide on the major allele in tumour if different from REF, otherwise it is the nucleotide on the minor allele. QUAL is Phred_quality score and FILTER is PASS if the probability of being somatic is greater than –threshold. ID would be eventually a unique id but not yet implemented.

INFO column contains the following information:

  • PR: predicted probability of being somatic mutation
  • TR: number of nucleotides in the tumour bam file in the position POS that match to the reference nucleotide reported by REF TA: similar to TR but matches to the alternative nucleotide reported by ALT
  • NR: similar to TR but for the normal bam file
  • NA: similar to NR but matches to the alternative nucleotide reported by ALT
  • TC: Tri-nucleotide context
  • NI: Number of insertions in the vicinity of a position
  • ND: Number of deletions in the vicinity of a position

The single sample mutationseq reports the genotype and the likelihoods for the genotypes in addition to the all the previous fields. Below is the sample meta information and data lines:

##fileDate=$date of the file
##reference= $the reference used in the command line
##tumour=$the tumour file used in the command line
##normal=$the normal file used in the command line
##threshold= $ the value specified as threshold in the command line
##INFO=<ID=PR,Number=1,Type=Float,Description=”Probability of somatic mutation”>
##INFO=<ID=TR,Number=1,Type=String,Description=”Count of tumour with reference to REF”>
##INFO=<ID=TA,Number=1,Type=String,Description=”Count of tumour with reference to ALT”>
##INFO=<ID=NR,Number=1,Type=String,Description=”Count of normal with reference to REF”>
##INFO=<ID=NA,Number=1,Type=String,Description=”Count of normal with reference to ALT”>
##INFO=<ID=TC,Number=1,Type=String,Description=”Tri-nucleotide context”>
##INFO=<ID=NI,Number=1,Type=String,Description=”Number of insertions in the vicinity of a position”>
##INFO=<ID=ND,Number=1,Type=String,Description=”Number of deletions in the vicinity of a position”>
##INFO=<ID=PL,Number=1,Type=String,Description=”Phred-scaled genotype likelihoods”>
##FILTER=<ID=threshold,Description=”Threshold on probability of positive call”>


#CHROM    POS    ID    REF    ALT    QUAL    INFO
1    249    .    A    G    0.60    PASS   PR=0.129;TR=62;TA=4;NR=112;NA=0;TC=AAA;NI=0;ND=0;GT=0/0;PL=0,9,255
1    268    .    A    C    0.65    INDL   PR=0.139;TR=55;TA=5;NR=96;NA=0;TC=TAA;NI=3;ND=1;GT=0/0;PL=0,9,255
1    274    .    A    G    1.81    PASS   PR=0.341;TR=58;TA=3;NR=95;NA=0;TC=TAA;NI=0;ND=0;GT=0/0;PL=0,12,255
1    280    .    A    C    0.41    FAIL   PR=0.09;TR=52;TA=1;NR=88;NA=1;TC=TAA;NI=0;ND=0;GT=0/1;PL=4,2,255 aims to extract features of the labeled training data and fit them to a model based on the random forest algorithm to avoid over fitting. It is also robust against training sets with unequal distributions of positive and negative samples and supports multi-class classification.

To train a new model run the following commands from linux terminal:

cd <$mutationSeq>
python posfile.pos [–options] aims to generate a new model using new training data. A position input file, e.g. posfile.pos, is required. This file consists of a list of space-delimited columns. The columns are: chromosome, position, and label. The file also contains a header which specifies the following information:

normal /path/to/normal_file.bam
tumour /path/to/tumour_file.bam
reference /path/to/reference.fasta

Figure 3 shows a sample position file.

# normal /path/to/normal_file.bam# tumour /path/to/tumour_file.bam

# reference /path/to/reference.fasta

10 89682885 SOMATIC

21 46676181 GERMLINE

17 19127560 WILDTYPE

 Figure 3. sample position input file for running

Following labels are accepted in the position file
  • SOMATIC: somatic mutation
  • WILDTYPE: wildtype
  • GERMLINE: germline
  • HET: heterozygous for single sample mode
  • HOM: homozygous for single sample mode
  • HET_ONE: the same as HET but for paired mode
  • HOM_ONE: the same as HOM but for paired mode
  • HOM_GERMLINE: homozygous germline
  • HET_GERMLINE: heterozygous germline
  • CLEAN: clean positions, i.e. no variant

To make consider any one of these labels as positive, simply pass the name of the label to the –labels option in the input. The rest of the labels will be considered as negative. For instance,

python –labels SOMATIC,GERMLINE [options]

makes SOMATIC and GERMLINE labels in the input position file be considered positive labels and the rest of the labels will be negative.

The following options are available:

-h –help

print usage help – optional 

-d –deep

deepseq training – optional 


specify the label in training file list to be considered positive – optional 

-l –log

specify /path/to/file.log to write log information – optional 

-m –model

specify /path/to/existing_model.npz, usually used for validation – optional 


specify if you want to train with normalized features, deprecated – optional 

-o –out

/path/to/output.vcf, default=None – mandatory 

-s –single

single sample training – optional 


activate the option of validating the same format file with known labels, metavar=’FILE’, nargs=’*’, default=None – optional 


verbosity – optional

–version print version of software  – optional


This is a typical example that would run on the specified posfile.pos. Likely, a training set will be a list of isolated positions and truth labels across different cases. extracts the features for these positions and fit them to a model. will accept a glob of position files (*.pos) that have chromosome, position, and somatic status. For each of these files it expects a header with the normal, tumour, and reference paths.

cd <$mutationSeq>
python /share/…/posfile.pos –out /share/…/sample_model.npz


sample_model.npz will be the main output of saved in the directory specified by –out option. This model can be later used for classification using The cross-validation/ validation results and ROC curves plots are also saved in the specified directory.