Universidade de Aveiro
The home page of Paulo J. S. G. Ferreira

DNA sequences

DNA and protein sequences may be represented as strings over a finite alphabet (in the case of DNA, the alphabet consists of A, C, G and T). Such sequences may also be regarded as symbolic signals, or symbolic vectors. Each signal sample (or each vector element) is a symbol, rather than a number. There is no order relation defined on the symbols, and no algebraic structure whatsoever. Such signals also arise in statistics (where they are sometimes called categorical data).

These papers are concerned with DNA sequence analysis, absent words or nullomers (strings that do not occur in the DNA), DNA data compression and the spectrum of symbolic signals. Regarding this last point, note that symbolic signals may exhibit periodicities, just like any other signal. For example, ACTACTACT... has period three. Is it possible to do Fourier analysis on symbolic signals? Fourier analysis is very useful in finite fields and in groups (commutative or noncommutative), but symbolic data lack such algebraic structure. Symbolic-to-numeric mapping is possible but yields results that depend on the mapping used. The methods discussed in a couple of these papers circumvent this difficulty.

Vera Afreixo, Carlos A. C. Bastos, Armando J. Pinho, Sara P. Garcia and Paulo J. S. G. Ferreira. Genome analysis with distance to the nearest dissimilar nucleotide. Journal of Theoretical Biology, vol. 275, no. 1, pp. 52-58, Apr. 2011.

DNA may be represented by sequences of four symbols, but it is often useful to convert those symbols into real or complex numbers for further analysis. Several mapping schemes have been used in the past, but most of them seem to be unrelated to any intrinsic characteristic of DNA. The objective of this work was to study a mapping scheme that is directly related to DNA characteristics, and that could be useful in discriminating between different species. Recently, we have proposed a methodology based on the inter-nucleotide distance, which proved to contribute to the discrimination among species. In this paper, we introduce a new distance, the distance to the nearest dissimilar nucleotide, which is the distance of a nucleotide to first occurrence of a different nucleotide. This distance is related to the repetition structure of single nucleotides. Using the information resulting from the concatenation of the distance to the nearest dissimilar and the inter-nucleotide distance, we found that this new distance brings additional discriminative capabilities. This suggests that the distance to the nearest dissimilar nucleotide might contribute with useful information about the evolution of the species.

Sara P. Garcia, Armando J. Pinho, João M. O. S. Rodrigues, Carlos A. C. Bastos and Paulo J. S. G. Ferreira. Minimal Absent Words in Prokaryotic and Eukaryotic Genomes. PLoS ONE, vol. 6, no. 1, pp. e16065, Jan. 2011.

Minimal absent words have been computed in genomes of organisms from all domains of life. Here, we explore different sets of minimal absent words in the genomes of 22 organisms (one archaeota, thirteen bacteria and eight eukaryotes). We investigate if the mutational biases that may explain the deficit of the shortest absent words in vertebrates are also pervasive in other absent words, namely in minimal absent words, as well as to other organisms. We find that the compositional biases observed for the shortest absent words in vertebrates are not uniform throughout different sets of minimal absent words. We further investigate the hypothesis of the inheritance of minimal absent words through common ancestry from the similarity in dinucleotide relative abundances of different sets of minimal absent words, and find that this inheritance may be exclusive to vertebrates.

Armando J. Pinho, Sara P. Garcia, Paulo J. S. G. Ferreira, Vera Afreixo, Carlos A. C. Bastos, António J. R. Neves and João M. O. S. Rodrigues. Exploring Homology Using the Concept of Three-State Entropy Vector. In: Pattern Recognition in Bioinformatics, Tjeerd Dijkstra, Evgeni Tsivtsivadze, Elena Marchiori and Tom Heskes (Eds.). Lecture Notes in Computer Science, vol. 6282, Springer, Berlin, pp. 161-170, 2010.

The three-base periodicity usually found in exons has been used for several purposes, as for example the prediction of potential genes. In this paper, we use a data model, previously proposed for encoding protein-coding regions of DNA sequences, to build signatures capable of supporting the construction of meaningful dendograms. The model relies on the three-base periodicity and provides an estimate of the entropy associated with each of the three bases of the codons. We observe that the three entropy values vary among themselves and also from species to species. Moreover, we provide evidence that this makes it possible to associate a three-state entropy vector with each species, and we show that similar species are characterized by similar three-state entropy vectors.

Vera Afreixo, Carlos A. C. Bastos, Armando J. Pinho, Sara P. Garcia and Paulo J. S. G. Ferreira. Genome analysis with inter-nucleotide distances. Bioinformatics, vol. 25, no. 23, pp. 3064-3070, 2009.

Presents a method to process DNA sequences based on inter-nucleotide distances and which yields genomic signatures for complete genomes able to discriminate between different species. Using these signatures and hierarchical clustering it is possible to build phylogenetic trees. Phylogenetic trees lead to genome differentiation and allow the inference of phylogenetic relations. The phylogenetic trees obtained suggest that the inter-nucleotide distances capture the essential information about the genomes. To create the genomic signature, we construct a vector which describes the inter-nucleotide distance distribution of a complete genome and compare it with the reference distance distribution, which is the distribution of a sequence where the nucleotides are placed randomly and independently. It is the residual or relative error between the data and the reference distribution that is used to compare the DNA sequences of different organisms.

Armando J. Pinho, Paulo J. S. G. Ferreira, Sara P. Garcia and João M. O. S. Rodrigues. On Finding Minimal Absent Words. BMC Bioinformatics, vol. 10, no. 1, pp. 137, 2009.

highly accessed Highly accessed according to the publisher. Open access article.

Shows how absent words relate to the repetitions and structure of the DNA data and defines the class of minimal absent words. The words of this new class are minimal in the sense that if their leftmost or rightmost character is removed, then the resulting word is no longer an absent word. The paper gives an algorithm for generating minimal absent words that, in practice, runs in approximately linear time. The algorithm is publically available. The class of minimal absent words is larger than that of shortest absent words yet still manageable, since it grows at most linearly with the string size (unlike the class of generic absent words, which grows exponentially).

Armando J. Pinho, António J. R. Neves, Carlos A. C. Bastos and Paulo J. S. G. Ferreira. DNA Coding Using Finite-Context Models and Arithmetic Coding. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2009, Taipei, Taiwan, pp. 1693-1696, Apr. 2009.

Several specific DNA lossless compression methods have been proposed. Most of them explore the presence of exact or approximate repeats and use low order finite-context models as a secondary or fall back mechanism. This paper shows that finite-context models can be used as primary DNA encoding methods. The coding method is based on two finite-context models that compete for the encoding of data, on a block by block basis.

A. J. Pinho, A. J. R. Neves, V. Afreixo, C. A. C. Bastos and P. J. S. G. Ferreira. A Three-State Model for DNA Protein-Coding Regions. IEEE Transactions on Biomedical Engineering, vol. 53, no. 11, pp. 2148-2155, Nov. 2006.

The protein-coding regions of DNA usually exhibit a three-base periodicity. This paper explores this fact. It introduces a DNA model based on three deterministic states, where each state implements a finite-context model. The experimental results obtained confirm the appropriateness of the proposed approach, showing compression gains in relation to the single finite-context model counterpart. Additionally, and potentially more interesting than the compression gain on its own, is the observation that the entropy associated to each of the three base positions of a codon differs and that this variation is not the same among the organisms analyzed.

Paulo J. S. G. Ferreira, António J. R. Neves, Vera Afreixo and Armando J. Pinho. Exploring three-base periodicity for DNA compression and modeling. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2006, vol. V, Toulouse, France, pp. 877-880, May 2006.

This paper explores the three-base periodicity usually found in the protein-coding regions of DNA. It describes a DNA model based on three deterministic states, where each state implements a finite-context model, and discusses the experimental results obtained, which show compression gains in relation to the single finite-context model counterpart. A more complete account can be found in A Three-State Model for DNA Protein-Coding Regions.

Vera Afreixo, P. J. S. G. Ferreira and Dorabella Santos. Fourier Analysis of Symbolic Data: A Brief review. Digital Signal Processing, vol. 14, no. 6, pp. 523-530, Nov. 2004.

The paper discusses methods for the Fourier analysis of symbolic data, such as DNA sequences. The indicator sequence approach is shown to be equivalent to the method based on the autocorrelation function, and also to the spectral envelope method.

One of the purposes of the paper is to present an approach to the analysis of symbolic data based on the autocorrelation concept. It can be defined in a very natural way and it leads to a numerical sequence. Then, we show that the Fourier transform of this sequence can be expressed in terms of the binary indicator sequences for each of the alphabet symbols. This approach sheds some light on the interplay between several of the methods that have been proposed for the Fourier analysis of symbolic DNA data.

Efficient computation procedures are also discussed, and it is shown that spectral analysis can be more efficient when obtained directly from the Fourier transforms of the indicator sequences.

Vera Afreixo, Paulo J. S. G. Ferreira and Dorabella Santos. Spectrum and symbol distribution of nucleotide sequences. Physical Review E, vol. 70, no. 3, pp. 031910, Sep. 2004.

The most interesting result of the paper is perhaps the fast algorithm for computing the Fourier coefficient at N/3 (corresponding to period three). The magnitude of this coefficient is important for gene detection. The method proposed in the paper requires only trivial discrete Fourier transforms (DFT) of length three.

The first step of the method is to obtain the symbol counts for each nucleotide, that is, the number of nucleotides (A, C, G or T) in positions 3n, 3n+1, and 3n+2. A simple computation (equivalent to a DFT of length three) acts on these symbol counts and yields the magnitude of the Fourier coefficient at N/3. These symbol counts are related to the so-called block-sum transformation, in this case of length three (see also A note on the block sum transformation in the section DFT).

The paper explores the connection between the size of the spectral coefficients of a nucleotide or any other symbolic sequence, and the distribution of nucleotides along certain subsequences. It explains the connection between the nucleotide distribution and the size of the spectral coefficients, and gives a necessary and sufficient condition for a coefficient to have a prescribed magnitude. Furthermore, it gives a fast algorithm for computing the value of a given spectral coefficient of a nucleotide sequence, discussing periods three and four as examples. Finally, it shows that the spectrum of a symbolic sequence is redundant, in the sense that there exists a linear recursion that determines the values of all the coefficients from those of a subset.

This paper was included in the Virtual Journal of Biological Physics Research, Volume 8, Issue 7, October 2004.