Filtered Spaced Word Matches

Filtered Spaced Words Matches (FSWM) is a fast alignment-free approach to estimates distances between genomic sequences - i.e. the number of substitutions per position between pairs of genomic sequences. For a set of input sequences, FSWM produces a distance matrix; on our server a phylogenetic tree is then calculated from these distances. A program run on a pair of eukaryotic genomes of a few hundred Mb each takes only a few minutes. In addition to being fast, the estimated distances are very accurate which leads to phylogenetic trees of high quality.

FSWM works in 4 steps:

First, FSWM finds all 'spaced-word matches' between a pair of genomic sequences with respect to a pre-defined binary pattern of 'match' and 'don't-care' positions. A 'spaced-word match' is a gap-free alignment of two words from the input sequences, that has the same length as the pattern. Nucleotides at the 'match' positions must be identical, while mismatches are possible at the 'don't-care positions'.

For each pair of input sequences, our program estimates the average number of substitutions per position, based on the ratio of mismatches at the 'don't-care positions' of the spaced-word matches.

To distinguish between spaced-word matches representing true homologies and spurious background spaced-word matches, our program gives a score to each spaced word match, based on a nucleotide substitution matrix (Chiaromonte et al., 2002):

The score of a spaced-word match is then defined as the sum of the substitution scores of the aligned nucleotides at the don't care positions. In the above example, there are four don't-care positions, so four pairs of aligned nucleotides are to be considered, namely (C,C), (T,A), (A,C) and (A,G). With the above substitution matrix, the score of this spaced-word match is therefore 100-123-114-31 = -168.

Our program discards all spaced-word matches with a score below a certain threshold. By default, the threshold is zero, but the user can adjust this value.

For each pair of input sequences, the scores of the spaced-word matches can be visualized using a so-called 'spaced-word histogram'. Here, the number of spaced-word matches is plotted against the scores, i.e. for each possible score value the number of spaced-word matches with this score is plotted.

For a pair of i.i.d. random sequences, two peaks are visible in the spaced-word histograms which are both approximately binomially distributed (see above). The peak on the left-hand side (negative score values) essentially consists of random background hits, while the peak on the right-hand side (positive values) consists of spaced-word matches representing true homologies. Please note that for real data, the homologous peak looks more complex because the varying degrees of sequence similarity in different parts of the sequences, see below (Octadecabacter arcticus 238 vs Octadecabacter antarticus 307).

Spaced-word histograms can be visualized by our web server and the user can change the threshold for each pairwise sequence comparison individually, to distinguish true homologies from noise. Our command line tool, which can be downloaded in the download section, allows the user as well to change the threshold, but the difference compared to our web server is that only a global threshold can be set for all sequence pairs.

FSWM also removes ambigous spaced-word matches, see our paper for more details.

After the threshold is set and the background spaced-word matches are discarded, FSWM estimates for each pair of input sequences their evolutionary distance, i.e. the average number of substitutions per sequence position since the two sequences diverged from their last common ancestor. To this end, we calculate mismatch rate for the aligned nucleotide pairs at the don’t care positions of the remaining (homologous) spaced-word matches. We then use the Jukes-Cantor correction to estimate the number of substitutions per position. The output of the command line program is a distance matrix in phylip format, containing the estimated distances for all sequence pairs. On our web server the distance matrix can be downloaded, including the resulting tree, obtained with Neighbor-Joining.

The program source code for FSWM including a README file can be downloaded here: Github

The program takes a set of genomic sequences as input and generates a distance matrix with the pairwise distances between the input sequences.

Usage:

to compile type: make

run with: ./fswm [options] <sequences >

<sequence> format:

The input sequences must be contained in one single file FASTA format. Each species/genome must be represented by one single sequence in the input FASTA file. If you have multiple reads, contigs or chromosomes per input species, please concatenate them to one single sequence to make sure each species/genome corresponds to only one sequence. Example:

>Genome1
ATAGTAGATGAT..
>Genome2
ATAGTAGTAGTAG..
>Genome3
ATGATGATGATGATG..
..

etc.

options:

-h: print this help and exit
-k <integer>: pattern weight (default 12)
-t <integer>: numer of threads (default: 10)
-s <integer>: the minimum score of a spaced-word match to be considered homologous (default: 0)

The program that was used to generate artificial genome sequences can be downloaded here: Download

Usage

compile: make

run: ./seqGen [options]

options:

-h: print help and exit
-m <double>: substitution rate
-i <double>: indel rate
-f <file>: use genome in <file> as root sequence, if -f unspecified an artificial sequence is generated as root.
-l <integer>: specify length if no root sequence is provided, i.e. if -f <file> is unspecified"

example 1:

./seqGen -m 0.5 -i 0.0 -f genonme.fasta

creates a pair of sequences with 0.5 substitutions per position and no indels, 
starting from the provided ancestral genome in the file genome.fasta

example 2:

	
./seqGen -m 0.01 -i 0.5 -l 5000000

creates a pair of sequences with 0.01 substitutions and with a 0.5% indel probability per position, 
starting from a randomly generated sequence of length 5.000.000

Documentation

Only DNA sequences are allowed. Sequence must be in FASTA format. All genomes must be contained in one FASTA file. Example:

>Genome1
ATAGTAGATGAT..
ATTGTAGTAG..
>Genome2
ATAGTAGTAGTAG..
GTAGTA....
>Genome3
TGATGATGATGATG..
...

etc.

If some of your genomes contain multiple chromesomes or contigs, please concatenate them into one sequence. Sequences in the file will be pairwise compared.

The file size on our webserver is limited to 512mb. Further restrictions are: Minimum length for each sequence is 1000 bp, sequence count must be between 2 and 100.

The weight is by default 12. The smaller the weight is, the longer the progam takes and the larger it is, the shorter. This default value of weight works well with most prokaryotic genomes. If you work with smaller genomes, a smaller weight is recommended. However, if larger sequences are to be analysed, a higher weight will keep the run time within a resonable range.

Click this button to start the job. Please do not close the browser during file upload. Your job will start on the server and you will get a url, with which the results can be accessed later. The browser can be closed now but keep this url to access your results.

The results are presented in as a spaced-word histogram. Above the histogram you can switch between all pair-wise comprisons. The red area of the chart presents background matches and the green area homologous matches.

The threshold should separate the background and homologous peak, which are marked as red and green area in the histogram. By default it is set to 0. However, this default value might not work for all datasets. In this case, you can toogle the slide and adjust the threshold so that it looks proper to you. The distances will be updated accroding to the threshold you set and can be downloaded throught download button. Sometimes the peaks are not clearly visible, setting the y-Axis to log scale can be helpful.

Contact:

bmorgen@gwdg.de

Scientific publications using filtered spaced word matches should cite:

C.-A. Leimeister, S. Sohrabi-Jahromi, B. Morgenstern (2017)

Fast and Accurate Phylogeny Reconstruction using Filtered Spaced-Word Matches

Bioinfomatics 33, 971-979, DOI 10.1093/bioinformatics/btw776

Usage:

Usage

Upload Multi-FASTA file ^?

[Download example file] [Limits]

Set weight (number of match positions)

Documentation

Contact:

Scientific publications using filtered spaced word matches should cite:

Usage:

Usage

Upload Multi-FASTA file ?

[Download example file] [Limits]

Set weight (number of match positions)

Documentation

Contact:

Scientific publications using filtered spaced word matches should cite:

Upload Multi-FASTA file ^?