Bioinformatics
DNA and protein sequences may be represented as strings over a finite alphabet (in the case of DNA, the alphabet consists of A, C, G and T). Such sequences may also be regarded as symbolic signals, or symbolic vectors. Each signal sample (or each vector element) is a symbol, rather than a number. There is no order relation defined on the symbols, and no algebraic structure whatsoever. Such signals also arise in statistics (where they are sometimes called categorical data).
These papers are concerned with DNA sequence analysis, absent words or nullomers (strings that do not occur in the DNA), DNA data compression and the spectrum of symbolic signals. Regarding this last point, note that symbolic signals may exhibit periodicities, just like any other signal. For example, ACTACTACT... has period three. Is it possible to do Fourier analysis on symbolic signals? Fourier analysis is very useful in finite fields and in groups (commutative or noncommutative), but symbolic data lack such algebraic structure. Symbolic-to-numeric mapping is possible but yields results that depend on the mapping used. The methods discussed in a couple of these papers circumvent this difficulty.
Describes a completely alignment-free computational
method, based on a blind unsupervised approach, to detect
large-scale and small-scale genomic rearrangements between
pairs of DNA sequences. To illustrate the power and usefulness
of the method complete human-chimpanzee and human-orangutan
chromosomal information maps are given. The software used has
been made publicly available and is described in detail.
Motivation: Ebola virus causes high mortality hemorrhagic
fevers, with more than 25 000 cases and 10 000 deaths in
the current outbreak. Only experimental therapies are
available, thus, novel diagnosis tools and druggable targets
are needed.
Results: Analysis of Ebola virus genomes from
the current outbreak reveals the presence of short DNA
sequences that appear nowhere in the human genome. We identify
the shortest such sequences with lengths between 12 and 14.
Only three absent sequences of length 12 exist and they
consistently appear at the same location on two of the Ebola
virus proteins, in all Ebola virus genomes, but nowhere in the
human genome. The alignment-free method used is able to
identify pathogen-specific signatures for quick and precise
action against infectious agents, of which the current Ebola
virus outbreak provides a compelling example.
Availability
and Implementation: EAGLE is freely available for
non-commercial purposes at
http://bioinformatics.ua.pt/software/eagle.
Similarity measures based on compression assess the
distance between two objects based on the number of bits needed
to describe one, given a description of the other.
Theoretically, compression-based similarity depends on the
concept of Kolmogorov complexity, which is non-computable. The
implementations require compression algorithms that are
approximately normal. The approach has important advantages (no
signal features to identify and extract, for example) but the
compression method must be normal. This paper proposes normal
algorithms based on mixtures of finite context models.
Normality is attained by combining two new ideas: the use of
least-recently-used caching in the context models, to allow
deeper contexts, and data interleaving, to better explore that
cache. Examples for DNA sequences are given (at the human
genome scale).
Data summarization and triage is one of the current top
challenges in visual analytics. Ideally, one would like to let
users visually inspect large data sets and examine or request
data with particular characteristics. The need for
summarization and visual analytics is also felt when dealing
with digital representations of DNA sequences. Genomic data
sets are growing rapidly, making their analysis increasingly
more difficult, and raising the need for new, scalable tools.
For example, being able to look at very large DNA sequences
while immediately identifying potentially interesting regions
would provide the biologist with a flexible exploratory and
analytical tool. In this paper we present a new concept, the
"information profile", which provides a quantitative measure of
the local complexity of a DNA sequence, independently of the
direction of processing. The computation of the information
profiles is computationally tractable: we show that it can be
done in time proportional to the length of the sequence. We
also describe a tool to compute the information profiles of a
given DNA sequence, and use the genome of the fission yeast
Schizosaccharomyces pombe strain 972 h- and five human
chromosomes 22 for illustration. We show that information
profiles are useful for detecting large-scale genomic
regularities by visual inspection. Several discovery strategies
are possible, including the standalone analysis of single
sequences, the comparative analysis of sequences from
individuals from the same species, and the comparative analysis
of sequences from different organisms. The comparison scale can
be varied, allowing the users to zoom-in on specific details,
or obtain a broad overview of a long segment. Software
applications have been made
available for non-commercial use.
Genome assemblies are typically compared with respect to
their contiguity, coverage and accuracy. This paper proposes a
genome-wide, alignment-free distance based on compressed
maximal exact matches, that is, perfect repeats, without gaps
or misspellings, which cannot be further left- or
right-extended without loss of similarity. The genomic distance
proposed is based on the normalized compression distance, an
information-theoretic measure of the relative compressibility
of two sequences, as well as the nesting structure underlying
the assembly of larger maximal exact matches from smaller ones.
Four human genome assemblies are used for illustration.
Finally, we discuss the impact of genome sequencing and
assembly in the final content of maximal exact matches and in
the proposed genomic distance.
Previous studies have suggested that Chargaff's second
rule may hold for relatively long words (above 10 nucleotides),
but this has not been conclusively shown. In particular, the
following questions remain open: Is the phenomenon of symmetry
statistically significant? If so, what is the word length above
which significance is lost? Can deviations in symmetry due to
the finite size of the data be identified? This work addresses
these questions by studying word symmetries in the human
genome, chromosomes and transcriptome. To rule out
finite-length effects, the results are compared with those
obtained from random control sequences built to satisfy
Chargaff's second parity rule. We use several techniques to
evaluate the phenomenon of symmetry, including Pearson's
correlation coefficient, total variational distance, a novel
word symmetry distance, as well as traditional and equivalence
statistical tests. We conclude that word symmetries are
statistical significant in the human genome for word lengths up
to 6 nucleotides. For longer words, we present evidence that
the phenomenon may not be as prevalent as previously thought.
A finite-context (Markov) model of order k yields the
probability distribution of the next symbol in a sequence of
symbols, given the recent past up to depth k. Markov modeling
has long been applied to DNA sequences, for example to find
gene-coding regions. With the first studies came the discovery
that DNA sequences are non-stationary: distinct regions require
distinct model orders. Since then, Markov and hidden Markov
models have been extensively used to describe the gene
structure of prokaryotes and eukaryotes. This paper is the
first comprehensive study about the potential of Markov models
to describe complete genomes. Our approach relies on (i)
multiple competing Markov models of different orders (ii)
careful programming techniques that allow orders as large as
sixteen (iii) adequate inverted repeat handling (iv)
probability estimates suited to the wide range of context
depths used. To measure how well a model fits the data at a
particular position in the sequence we use the negative
logarithm of the probability estimate at that position. The
measure yields information profiles of the sequence, which are
of independent interest. The average over the entire sequence,
which amounts to the average number of bits per base needed to
describe the sequence, is used as a global performance measure.
Our main conclusion is that, from the probabilistic or
information theoretic point of view and according to this
performance measure, multiple competing Markov models explain
entire genomes almost as well or even better than
state-of-the-art DNA compression methods, such as XM, which
rely on very different statistical models.
DNA may be represented by sequences of four symbols, but
it is often useful to convert those symbols into real or
complex numbers for further analysis. Several mapping schemes
have been used in the past, but most of them seem to be
unrelated to any intrinsic characteristic of DNA. The objective
of this work was to study a mapping scheme that is directly
related to DNA characteristics, and that could be useful in
discriminating between different species. Recently, we have
proposed a methodology based on the inter-nucleotide distance,
which proved to contribute to the discrimination among species.
In this paper, we introduce a new distance, the distance to the
nearest dissimilar nucleotide, which is the distance of a
nucleotide to first occurrence of a different nucleotide. This
distance is related to the repetition structure of single
nucleotides. Using the information resulting from the
concatenation of the distance to the nearest dissimilar and the
inter-nucleotide distance, we found that this new distance
brings additional discriminative capabilities. This suggests
that the distance to the nearest dissimilar nucleotide might
contribute with useful information about the evolution of the
species.
Minimal absent words have been computed in genomes of
organisms from all domains of life. Here, we explore different
sets of minimal absent words in the genomes of 22 organisms
(one archaeota, thirteen bacteria and eight eukaryotes). We
investigate if the mutational biases that may explain the
deficit of the shortest absent words in vertebrates are also
pervasive in other absent words, namely in minimal absent
words, as well as to other organisms. We find that the
compositional biases observed for the shortest absent words in
vertebrates are not uniform throughout different sets of
minimal absent words. We further investigate the hypothesis of
the inheritance of minimal absent words through common ancestry
from the similarity in dinucleotide relative abundances of
different sets of minimal absent words, and find that this
inheritance may be exclusive to vertebrates.
The three-base periodicity usually found in exons has been
used for several purposes, as for example the prediction of
potential genes. In this paper, we use a data model, previously
proposed for encoding protein-coding regions of DNA sequences,
to build signatures capable of supporting the construction of
meaningful dendograms. The model relies on the three-base
periodicity and provides an estimate of the entropy associated
with each of the three bases of the codons. We observe that the
three entropy values vary among themselves and also from
species to species. Moreover, we provide evidence that this
makes it possible to associate a three-state entropy vector
with each species, and we show that similar species are
characterized by similar three-state entropy vectors.
Presents a method to process DNA sequences based on
inter-nucleotide distances and which yields genomic signatures
for complete genomes able to discriminate between different
species. Using these signatures and hierarchical clustering it
is possible to build phylogenetic trees. Phylogenetic trees
lead to genome differentiation and allow the inference of
phylogenetic relations. The phylogenetic trees obtained suggest
that the inter-nucleotide distances capture the essential
information about the genomes. To create the genomic signature,
we construct a vector which describes the inter-nucleotide
distance distribution of a complete genome and compare it with
the reference distance distribution, which is the distribution
of a sequence where the nucleotides are placed randomly and
independently. It is the residual or relative error between the
data and the reference distribution that is used to compare the
DNA sequences of different organisms.
Highly accessed according to the publisher. Open
access article.
Shows how absent words relate to the repetitions and structure of the DNA data and defines the class of minimal absent words. The words of this new class are minimal in the sense that if their leftmost or rightmost character is removed, then the resulting word is no longer an absent word. The paper gives an algorithm for generating minimal absent words that, in practice, runs in approximately linear time. The algorithm is publically available. The class of minimal absent words is larger than that of shortest absent words yet still manageable, since it grows at most linearly with the string size (unlike the class of generic absent words, which grows exponentially).
Several specific DNA lossless compression methods have
been proposed. Most of them explore the presence of exact or
approximate repeats and use low order finite-context models as
a secondary or fall back mechanism. This paper shows that
finite-context models can be used as primary DNA encoding
methods. The coding method is based on two finite-context
models that compete for the encoding of data, on a block by
block basis.
The protein-coding regions of DNA usually exhibit a
three-base periodicity. This paper explores this fact. It
introduces a DNA model based on three deterministic states,
where each state implements a finite-context model. The
experimental results obtained confirm the appropriateness of
the proposed approach, showing compression gains in relation to
the single finite-context model counterpart. Additionally, and
potentially more interesting than the compression gain on its
own, is the observation that the entropy associated to each of
the three base positions of a codon differs and that this
variation is not the same among the organisms analyzed.
This paper explores the three-base periodicity usually found in the protein-coding regions of DNA. It describes a DNA model based on three deterministic states, where each state implements a finite-context model, and discusses the experimental results obtained, which show compression gains in relation to the single finite-context model counterpart. A more complete account can be found in A Three-State Model for DNA Protein-Coding Regions.
The paper discusses methods for the Fourier analysis of
symbolic data, such as DNA sequences. The indicator sequence
approach is shown to be equivalent to the method based on the
autocorrelation function, and also to the spectral envelope
method.
One of the purposes of the paper is to present an approach to the analysis of symbolic data based on the autocorrelation concept. It can be defined in a very natural way and it leads to a numerical sequence. Then, we show that the Fourier transform of this sequence can be expressed in terms of the binary indicator sequences for each of the alphabet symbols. This approach sheds some light on the interplay between several of the methods that have been proposed for the Fourier analysis of symbolic DNA data.
Efficient computation procedures are also discussed, and it is shown that spectral analysis can be more efficient when obtained directly from the Fourier transforms of the indicator sequences.
Among other things, this paper describes a fast algorithm
for computing the Fourier coefficient of a symbolic sequence at
N/3 (corresponding to period three). In DNA, the
magnitude of this coefficient is of interest in gene detection.
The method proposed in the paper requires only trivial discrete
Fourier transforms (DFT) of length three.
The first step of the method is to obtain the symbol counts for each nucleotide, that is, the number of nucleotides (A, C, G or T) in positions 3n, 3n+1, and 3n+2. A simple computation (equivalent to a DFT of length three) acts on these symbol counts and yields the magnitude of the Fourier coefficient at N/3. These symbol counts are related to the so-called block-sum transformation, in this case of length three (see also A note on the block sum transformation in the section DFT).
The paper explores the connection between the size of the spectral coefficients of a nucleotide or any other symbolic sequence, and the distribution of nucleotides along certain subsequences. It explains the connection between the nucleotide distribution and the size of the spectral coefficients, and gives a necessary and sufficient condition for a coefficient to have a prescribed magnitude. Furthermore, it gives a fast algorithm for computing the value of a given spectral coefficient of a nucleotide sequence, discussing periods three and four as examples. Finally, it shows that the spectrum of a symbolic sequence is redundant, in the sense that there exists a linear recursion that determines the values of all the coefficients from those of a subset.
This paper was included in the Virtual Journal of Biological Physics Research, Volume 8, Issue 7, October 2004.