Bioinformatics
DNA and protein sequences may be represented as strings over a finite alphabet (in the case of DNA, the alphabet consists of A, C, G and T). Such sequences may also be regarded as symbolic signals, or symbolic vectors. Each signal sample (or each vector element) is a symbol, rather than a number. There is no order relation defined on the symbols, and no algebraic structure whatsoever. Such signals also arise in statistics (where they are sometimes called categorical data).
These papers are concerned with DNA sequence analysis, absent words or nullomers (strings that do not occur in the DNA), DNA data compression and the spectrum of symbolic signals. Regarding this last point, note that symbolic signals may exhibit periodicities, just like any other signal. For example, ACTACTACT... has period three. Is it possible to do Fourier analysis on symbolic signals? Fourier analysis is very useful in finite fields and in groups (commutative or noncommutative), but symbolic data lack such algebraic structure. Symbolic-to-numeric mapping is possible but yields results that depend on the mapping used. The methods discussed in a couple of these papers circumvent this difficulty.
Describes a completely alignment-free computational method, based on a blind unsupervised approach, to detect large-scale and small-scale genomic rearrangements between pairs of DNA sequences. To illustrate the power and usefulness of the method complete human-chimpanzee and human-orangutan chromosomal information maps are given. The software used has been made publicly available and is described in detail.
Motivation: Ebola virus causes high mortality hemorrhagic
fevers, with more than 25 000 cases and 10 000 deaths in
the current outbreak. Only experimental therapies are
available, thus, novel diagnosis tools and druggable targets
are needed.
Results: Analysis of Ebola virus genomes from
the current outbreak reveals the presence of short DNA
sequences that appear nowhere in the human genome. We identify
the shortest such sequences with lengths between 12 and 14.
Only three absent sequences of length 12 exist and they
consistently appear at the same location on two of the Ebola
virus proteins, in all Ebola virus genomes, but nowhere in the
human genome. The alignment-free method used is able to
identify pathogen-specific signatures for quick and precise
action against infectious agents, of which the current Ebola
virus outbreak provides a compelling example.
Availability
and Implementation: EAGLE is freely available for
non-commercial purposes at
http://bioinformatics.ua.pt/software/eagle.
Similarity measures based on compression assess the distance between two objects based on the number of bits needed to describe one, given a description of the other. Theoretically, compression-based similarity depends on the concept of Kolmogorov complexity, which is non-computable. The implementations require compression algorithms that are approximately normal. The approach has important advantages (no signal features to identify and extract, for example) but the compression method must be normal. This paper proposes normal algorithms based on mixtures of finite context models. Normality is attained by combining two new ideas: the use of least-recently-used caching in the context models, to allow deeper contexts, and data interleaving, to better explore that cache. Examples for DNA sequences are given (at the human genome scale).
Data summarization and triage is one of the current top challenges in visual analytics. Ideally, one would like to let users visually inspect large data sets and examine or request data with particular characteristics. The need for summarization and visual analytics is also felt when dealing with digital representations of DNA sequences. Genomic data sets are growing rapidly, making their analysis increasingly more difficult, and raising the need for new, scalable tools. For example, being able to look at very large DNA sequences while immediately identifying potentially interesting regions would provide the biologist with a flexible exploratory and analytical tool. In this paper we present a new concept, the "information profile", which provides a quantitative measure of the local complexity of a DNA sequence, independently of the direction of processing. The computation of the information profiles is computationally tractable: we show that it can be done in time proportional to the length of the sequence. We also describe a tool to compute the information profiles of a given DNA sequence, and use the genome of the fission yeast Schizosaccharomyces pombe strain 972 h- and five human chromosomes 22 for illustration. We show that information profiles are useful for detecting large-scale genomic regularities by visual inspection. Several discovery strategies are possible, including the standalone analysis of single sequences, the comparative analysis of sequences from individuals from the same species, and the comparative analysis of sequences from different organisms. The comparison scale can be varied, allowing the users to zoom-in on specific details, or obtain a broad overview of a long segment. Software applications have been made available for non-commercial use.
Genome assemblies are typically compared with respect to their contiguity, coverage and accuracy. This paper proposes a genome-wide, alignment-free distance based on compressed maximal exact matches, that is, perfect repeats, without gaps or misspellings, which cannot be further left- or right-extended without loss of similarity. The genomic distance proposed is based on the normalized compression distance, an information-theoretic measure of the relative compressibility of two sequences, as well as the nesting structure underlying the assembly of larger maximal exact matches from smaller ones. Four human genome assemblies are used for illustration. Finally, we discuss the impact of genome sequencing and assembly in the final content of maximal exact matches and in the proposed genomic distance.
Previous studies have suggested that Chargaff's second rule may hold for relatively long words (above 10 nucleotides), but this has not been conclusively shown. In particular, the following questions remain open: Is the phenomenon of symmetry statistically significant? If so, what is the word length above which significance is lost? Can deviations in symmetry due to the finite size of the data be identified? This work addresses these questions by studying word symmetries in the human genome, chromosomes and transcriptome. To rule out finite-length effects, the results are compared with those obtained from random control sequences built to satisfy Chargaff's second parity rule. We use several techniques to evaluate the phenomenon of symmetry, including Pearson's correlation coefficient, total variational distance, a novel word symmetry distance, as well as traditional and equivalence statistical tests. We conclude that word symmetries are statistical significant in the human genome for word lengths up to 6 nucleotides. For longer words, we present evidence that the phenomenon may not be as prevalent as previously thought.
A finite-context (Markov) model of order k yields the probability distribution of the next symbol in a sequence of symbols, given the recent past up to depth k. Markov modeling has long been applied to DNA sequences, for example to find gene-coding regions. With the first studies came the discovery that DNA sequences are non-stationary: distinct regions require distinct model orders. Since then, Markov and hidden Markov models have been extensively used to describe the gene structure of prokaryotes and eukaryotes. This paper is the first comprehensive study about the potential of Markov models to describe complete genomes. Our approach relies on (i) multiple competing Markov models of different orders (ii) careful programming techniques that allow orders as large as sixteen (iii) adequate inverted repeat handling (iv) probability estimates suited to the wide range of context depths used. To measure how well a model fits the data at a particular position in the sequence we use the negative logarithm of the probability estimate at that position. The measure yields information profiles of the sequence, which are of independent interest. The average over the entire sequence, which amounts to the average number of bits per base needed to describe the sequence, is used as a global performance measure. Our main conclusion is that, from the probabilistic or information theoretic point of view and according to this performance measure, multiple competing Markov models explain entire genomes almost as well or even better than state-of-the-art DNA compression methods, such as XM, which rely on very different statistical models.
DNA may be represented by sequences of four symbols, but it is often useful to convert those symbols into real or complex numbers for further analysis. Several mapping schemes have been used in the past, but most of them seem to be unrelated to any intrinsic characteristic of DNA. The objective of this work was to study a mapping scheme that is directly related to DNA characteristics, and that could be useful in discriminating between different species. Recently, we have proposed a methodology based on the inter-nucleotide distance, which proved to contribute to the discrimination among species. In this paper, we introduce a new distance, the distance to the nearest dissimilar nucleotide, which is the distance of a nucleotide to first occurrence of a different nucleotide. This distance is related to the repetition structure of single nucleotides. Using the information resulting from the concatenation of the distance to the nearest dissimilar and the inter-nucleotide distance, we found that this new distance brings additional discriminative capabilities. This suggests that the distance to the nearest dissimilar nucleotide might contribute with useful information about the evolution of the species.
Minimal absent words have been computed in genomes of organisms from all domains of life. Here, we explore different sets of minimal absent words in the genomes of 22 organisms (one archaeota, thirteen bacteria and eight eukaryotes). We investigate if the mutational biases that may explain the deficit of the shortest absent words in vertebrates are also pervasive in other absent words, namely in minimal absent words, as well as to other organisms. We find that the compositional biases observed for the shortest absent words in vertebrates are not uniform throughout different sets of minimal absent words. We further investigate the hypothesis of the inheritance of minimal absent words through common ancestry from the similarity in dinucleotide relative abundances of different sets of minimal absent words, and find that this inheritance may be exclusive to vertebrates.
The three-base periodicity usually found in exons has been used for several purposes, as for example the prediction of potential genes. In this paper, we use a data model, previously proposed for encoding protein-coding regions of DNA sequences, to build signatures capable of supporting the construction of meaningful dendograms. The model relies on the three-base periodicity and provides an estimate of the entropy associated with each of the three bases of the codons. We observe that the three entropy values vary among themselves and also from species to species. Moreover, we provide evidence that this makes it possible to associate a three-state entropy vector with each species, and we show that similar species are characterized by similar three-state entropy vectors.
Presents a method to process DNA sequences based on inter-nucleotide distances and which yields genomic signatures for complete genomes able to discriminate between different species. Using these signatures and hierarchical clustering it is possible to build phylogenetic trees. Phylogenetic trees lead to genome differentiation and allow the inference of phylogenetic relations. The phylogenetic trees obtained suggest that the inter-nucleotide distances capture the essential information about the genomes. To create the genomic signature, we construct a vector which describes the inter-nucleotide distance distribution of a complete genome and compare it with the reference distance distribution, which is the distribution of a sequence where the nucleotides are placed randomly and independently. It is the residual or relative error between the data and the reference distribution that is used to compare the DNA sequences of different organisms.
Highly accessed according to the publisher. Open access article.
Shows how absent words relate to the repetitions and structure of the DNA data and defines the class of minimal absent words. The words of this new class are minimal in the sense that if their leftmost or rightmost character is removed, then the resulting word is no longer an absent word. The paper gives an algorithm for generating minimal absent words that, in practice, runs in approximately linear time. The algorithm is publically available. The class of minimal absent words is larger than that of shortest absent words yet still manageable, since it grows at most linearly with the string size (unlike the class of generic absent words, which grows exponentially).
Several specific DNA lossless compression methods have been proposed. Most of them explore the presence of exact or approximate repeats and use low order finite-context models as a secondary or fall back mechanism. This paper shows that finite-context models can be used as primary DNA encoding methods. The coding method is based on two finite-context models that compete for the encoding of data, on a block by block basis.
The protein-coding regions of DNA usually exhibit a three-base periodicity. This paper explores this fact. It introduces a DNA model based on three deterministic states, where each state implements a finite-context model. The experimental results obtained confirm the appropriateness of the proposed approach, showing compression gains in relation to the single finite-context model counterpart. Additionally, and potentially more interesting than the compression gain on its own, is the observation that the entropy associated to each of the three base positions of a codon differs and that this variation is not the same among the organisms analyzed.
This paper explores the three-base periodicity usually found in the protein-coding regions of DNA. It describes a DNA model based on three deterministic states, where each state implements a finite-context model, and discusses the experimental results obtained, which show compression gains in relation to the single finite-context model counterpart. A more complete account can be found in A Three-State Model for DNA Protein-Coding Regions.
The paper discusses methods for the Fourier analysis of symbolic data, such as DNA sequences. The indicator sequence approach is shown to be equivalent to the method based on the autocorrelation function, and also to the spectral envelope method.
One of the purposes of the paper is to present an approach to the analysis of symbolic data based on the autocorrelation concept. It can be defined in a very natural way and it leads to a numerical sequence. Then, we show that the Fourier transform of this sequence can be expressed in terms of the binary indicator sequences for each of the alphabet symbols. This approach sheds some light on the interplay between several of the methods that have been proposed for the Fourier analysis of symbolic DNA data.
Efficient computation procedures are also discussed, and it is shown that spectral analysis can be more efficient when obtained directly from the Fourier transforms of the indicator sequences.
Among other things, this paper describes a fast algorithm for computing the Fourier coefficient of a symbolic sequence at N/3 (corresponding to period three). In DNA, the magnitude of this coefficient is of interest in gene detection. The method proposed in the paper requires only trivial discrete Fourier transforms (DFT) of length three.
The first step of the method is to obtain the symbol counts for each nucleotide, that is, the number of nucleotides (A, C, G or T) in positions 3n, 3n+1, and 3n+2. A simple computation (equivalent to a DFT of length three) acts on these symbol counts and yields the magnitude of the Fourier coefficient at N/3. These symbol counts are related to the so-called block-sum transformation, in this case of length three (see also A note on the block sum transformation in the section DFT).
The paper explores the connection between the size of the spectral coefficients of a nucleotide or any other symbolic sequence, and the distribution of nucleotides along certain subsequences. It explains the connection between the nucleotide distribution and the size of the spectral coefficients, and gives a necessary and sufficient condition for a coefficient to have a prescribed magnitude. Furthermore, it gives a fast algorithm for computing the value of a given spectral coefficient of a nucleotide sequence, discussing periods three and four as examples. Finally, it shows that the spectrum of a symbolic sequence is redundant, in the sense that there exists a linear recursion that determines the values of all the coefficients from those of a subset.
This paper was included in the Virtual Journal of Biological Physics Research, Volume 8, Issue 7, October 2004.