Published: 21st August 2017 DOI: 10.4204/EPTCS.252 ISSN: 2075-2180 |
The AFL conference series was initiated by Prof. István Peák (1936-1989). He organized the AFL conferences in 1980, 1982, 1984, 1986 and 1988 and started the organization of AFL'90. These conferences were all held in the hills around Salgótarján. In 1986 and 1988, the title of the conference was Automata, Languages and Programming Systems. Since the untimely death of Prof. István Peák in 1990, the AFL conferences have been organized in every third year. Professors András Ádám and Pál Dömösi took the lead in the continuation of the series. In 1993 and 1996, two more "Salgótarján conferences" took place, the last three conferences of the series were held in Balatonfüred (AFL 2008), Debrecen (AFL'11), and Szeged (AFL 2014).
AFL 2017 intended to cover all important areas of automata and formal language theory and their applications such as grammars, acceptors and transducers for strings, trees, graphs, arrays, algebraic theories for automata and languages, combinatorial properties of words and languages, formal power series, decision problems, efficient algorithms for automata and languages, relations to complexity theory and logic, picture description and analysis, quantum computing, cryptography, concurrency, applications of automata and language theory in biology, natural language processing, and other fields.
The Programme Committee invited lectures from
This volume contains the abstracts of invited lectures and the texts of the 17 papers accepted from the 23 submissions to AFL 2017.
The members of the International Program Committee were
Abstract The advent of Next Generation Sequencing (NGS) technologies is characterized by the production of massive data consisting of short sequences (reads) and it has enabled the completion of several sequencing projects on entire populations.
This talk focuses on the application of concepts from automata theory and formal languages to Bioinformatics, motivated by the repetitive nature that is intrinsic to the sequences that are produced with NGS technologies.
The analysis of such data raises the need of formal tools that are able to take advantage of the similarities among sequences to index and compare NGS data in a memory (and hence time) efficient manner. We will explore some recent applications of Burrows-Wheeler Transform, Lyndon factorization and automata theory to the analysis of genomic data.
A vast collection of computational problems in Bioinformatics are strongly dependant on the type of available biological data. Thus the formalization, the computational complexity, and the algorithmic solutions of many of these problems are affected by the intrinsic characteristics of the input data, both in terms of quality and quantity. A clear example is represented by the computational projects on sequencing data, where the algorithmic approaches and the data structures employed to attack the de novo assembly problem have changed, following the advancement of sequencing technologies. Today sequencing technologies have brought a revolution in the common practice of researchers working in molecular biology. Sequencing any organism is feasible, cheap, and can be completed quickly: thus biologists may use hundreds of millions of short sequences (reads) for their genomic studies. This requires new computational tools that are able to deal with this huge amount of sequences data, in turn encouraging the development of novel formal approaches to the comparison, analysis and managing of sequencing data, known as NGS (next generation sequencing) data. Computer Science research on NGS has focused initially on the problem of representing such data with different formal approaches, where the representation in many cases is the first step towards to the analysis and comparison of the data. Here we try to survey and summarize some theoretical frameworks that have been developed in the field, and how they have changed our perspectives towards the topic of data structures and algorithms for molecular biology. A main characteristic of NGS data has guided these frameworks: the repetitive nature of reads due to the fact that they are extracted by the same sequence (the reference genome): most importantly, some input sequences can be viewed as a shifted version of another.
We will discuss the following main notions that have recently gained a strong attention in Bioinformatics, but have their foundations in stringology and formal language theory.
The investigation of the repetitive nature of a collection of strings has led to the use in Bioinformatics of two main notions developed in the context of compression and text indexing: the Burrows-Wheeler Transform [5] and the FM-index for a collection of strings. The BWT has extraordinarily improved the efficiency of performing fundamental operations on sequences in Bioinformatics, such as sequence alignment, thanks to its capability to provide a succinct representation of biological data.
We will discuss how to compute the overlap graph, which is a simple and compact representation of all prefix-suffix matches of a collection of strings. In [6] it has been shown a linear time algorithm O(nm) (being n the number of sequences, each of length m) to compute an overlap graph. The first lightweight approach for building an overlap graph has been given in [4], [3], leading to fast approach that directly works on the FM-index of strings [2] .
One of the challenging tasks in the theory of formal languages is the following one: given a language $L$ and a family of languages $\cal{L}$, prove that $L$ is not contained in $\cal{L}$. In particular cases, a series of necessary conditions for a language to be contained in a family is known such as, for example, necessary conditions expressed by pumping lemmas or by the interchange lemma for context-free languages[3]. Unfortunately, there is a considerable number of languages that meet some or all of the known necessary conditions to be contained in a family $\cal{L}$, but those languages are not contained in $\cal{L}$ or it is not known whether or not they belong to $\cal{L}$. A prominent example is the open problem of whether the language $Q$ of all primitive words is context-free [2].
In the current work, a different approach is developed. Many important families of languages are characterized by some class of automata. We aim to disprove that a given language $L$ is contained in family $\cal{L}$ by showing that $L$ cannot be accepted by an automaton from the class characterizing $\cal{L}$. The main issue in proofs like that will be to verify two facts:
Here, we make use of the observation that one can think of computations as being guided by, or containing or using information. A state $p$, for instance, contains the information about which computations can originate at $p$ given suitable inputs. In order to keep the concepts as independent of particular types of automata as possible, we introduce them for a model that has an unstructured storage type. Moreover, it has the following properties: it is one-way, that is, it cannot re-read an input symbol; it is non-deterministic; it has a non-deterministic start state; it is real-time, that is, it reads exactly one input symbol in every step. It is easily shown that every language, not containing the empty word, is accepted by an automaton of this kind, even by a real-time automaton with a single start state and a single accepting state, regardless of computability issues. Clearly, the usual subset construction allows to turn any automaton of this kind into a deterministic one. In several cases, one needs to represent the states in a specific way as to model the access to information contained in a state. For this purpose, a state may, for instance, be a word over a state representation alphabet. One class of automata using such a representation of states is that of single-state push-down automata accepting all context-free languages.
Notions like those of equivalence, unavoidability and informational containment for states or sets of states are introduced. We show general properties of the computations of automata on the basis of these notions. Then the following concepts are derived: information contents of states, establishing information, preserving information, and using information contained in states.
Finally, the case of single-state pushdown automata (in some normal form) are treated in more detail. (A preliminary version of these considerations have been published in [1] which has been strongly revised.) The concepts developed are used in order to provide a new necessary condition for languages to be context-free. It is demonstrated how to apply this statement in order to prove that particular languages are not context-free. Advantages of the new proof technique over the pumping lemmas are discussed.
Regular languages are perhaps the most thoroughly studied class of the Chomsky hierarchy. Because of the wealth of existing results and several alternative characterizations in terms of grammars, finite automata and logic, most decidability questions have been answered or are easy to deduce from previous results. Some very challenging problems for regular languages, especially concerning decidability, arose from applying word operations to all words in a regular language and asking under what conditions the result is still regular. We give an overview of the characterizations of images of regular languages under some, mostly biologically inspired such operations. Where appropriate, we discuss the related decidability problem, i.e., for an operation op is it decidable whether the image op(L) of a given regular language L is regular?
We start the survey with duplication, an operation inspired by the so called tandem repeats in genetic sequences, which takes a factor of a word and inserts a copy of it right next to the original occurrence. This operation has been extensively studied in theoretical computer science and bioinformatics. As a nice example of applying Ehrenfeucht's generalized version of the Myhill-Nerode theorem, Bovet and Varricchio [2] proved that closure under duplication, in their terminology copy languages, of arbitrary recursively enumerable languages over a binary alphabet is regular, by showing that iterated duplication induces a well-quasi order on binary words.
Next, we look at a similar operation, albeit one which had a different inspiration. Calbrix and Nivat defined the power of a language L in a paper about prefix and period languages of rational $\omega$-languages [4]: $pow(L) = \{w^i | w \in L, i \geq 1\}$. The image of a regular language is not always regular, for instance, the power of the regular language $ab^*$ is not even context-free. Calbrix and Nivat posed the problem of characterizing those regular languages whose powers are also regular, and to decide whether a given regular language has this property. After partial solutions given by Cachat [3] and Horváth et al. [9] the final - positive - answer was given through a characterization and decision algorithm [6] which exemplifies a theme common to the solution of another problem on our list, the regularity of pseudopalindromic closures.
Several operations on sequences were introduced which are either directly motivated by the biological phenomenon called stem-loop completion, or are very similar in nature to it. The mathematical hairpin concept refers to a word in which some suffix (or prefix) is the mirrored complement of a middle factor of the word, i.e., $uxv\bar{x}$, where $\bar{x}$ is the reverse of the letter-by-letter complement of x. The hairpin completion operation, which extends such a word into one with matching ends, was studied both as a one-step operation and as an iterated one. The aforementioned questions regarding regularity of the image have been settled for one-step completion [5]. In the case of iterated completion, the state-of-the-art is a recent promising result on the completion of so called (2,2)-crossing words [11], which may allow a generalization to more complicated crossing words and thus to the solution of the problem for singleton starting languages.
Motivated by the open problem of iterated hairpin completion, a restricted version of it, pseudopalindromic completion was introduced in [7]. Pseudopalindromes are words that are fixed points for an antimorphic involution. In the case of pseudopalindromic completion, symbols are added to either side of the word such that the newly obtained words are pseudopalindromes. This restriction proved strong enough to allow a characterization of regular languages that are closed under this operation and to show that the regularity of the closure under the operation is decidable.
Finally, we discuss results about the regularity aspect in the case of recently introduced related operations: hairpin lengthening, which is another restricted version of hairpin completion, and its inverse, hairpin shortening [10]; prefix-suffix duplication, which is a duplication restricted to prefixes and suffixes, and its inverse, prefix-suffix reduction, introduced in [8] and [1], respectively.
Networks of Bio-inspired Processors (NBP) [3] are a family of models of computation inspired by different biological processes over biomacromolecules (mainly, DNA and RNA molecules). These operations can be applied over strings to define formal languages. Any bio-inspired processor contains a multiset of strings, where every string can be found in an arbitrarily large number of copies. The processors apply the operations that they have defined over the strings in a massively parallel and non-deterministic manner, and they obtains a (probably) new multiset of strings. A network model arranges the connection of a finite number of bio-inspired processors. The processors can be equipped with input/output filters to receive/send the strings from/to other processors connected to it. The network computes by alternating two kind of computation steps: operation application and string communication. These models have been formulated as accepting and generating devices.
The first model of NBPs was formulated by V. Mitrana (the founder of this research topic) as Networks of Evolutionary Processors (NEP) [1]. There, the operations in the processors were inspired by point mutation over the nucleotides in DNA. Later, the splicing biological process during the DNA transcription and RNA translation inspired the Networks of Splicing Processors (NSP) [2].
The Networks of Genetic Processors (NGP) were inspired by NEPs and NSPs, by using some operations previously defined over them. At every processor, only symbol substitutions or full string crossover are applied. Hence, the model can be easily related to classical Genetic Algorithms. It has been proved that NGPs are computationally complete given that they can simulate any Turing machine, if an accepting version of the model is defined [4]. The generating version of this model can be used to introduce a descriptive complexity measure of formal languages defined according to the classical Chomsky's hierarchy: three, four, six or eight processors can be used to characterize the regular, context-free, context-sensitive and recursively enumerable languages [6].
In addition, the NGPs can be used to formally prove the computational completeness of Genetic Algorithms, if some restrictions are imposed [4], and they have been fruitfully used to solve hard optimization combinatorial problems efficiently [5].
The problem of sorting lists by prefix reversals, as known as the pancake sorting problem, is to sort randomly piled pancakes of different size. A prefix reversal is an operation of reversing elements in the list including the beginning element. The minimum number of prefix reversals to sort a given list of integers into an ascending order was asked by Gates and Papadimitriou in [4] and an upper bound $\frac{5}{3}n$ for sorting a list of length $n$ was given in [4]. They conjectured that $\frac{19}{16}n$ is required, however, it is disproved in [5]. A better bound $\frac{18}{11}n$ was given in [2]. On the other hand, the pancake sorting problem is shown to be NP-hard in [1].
The pancake sorting problem can be extended to a rearrangement problem of two dimensional arrays by prefix reversals as follows. Suppose $A$ is an $n\times m$ array. Then $A$ comprises of $n \times m$ cells. We denote the entry in the $(i,j)$ position of $A$ by $a_{ij}$ (see Figure 1). The standard array $E_{n\times m}$ of size $n\times m$ is defined to be $(e_{ij})$ where $e_{ij}$ is the integer $(i-1)\times m + j$ (see Figure 2). We denote $A = \frac{\displaystyle A_1}{\displaystyle A_2}$ if $A$ comprises of an upper block $A_1$ and a lower block $A_2$, or $A = A_3 | A_4$ if $A$ comprises of a left block $A_3$ and a right block $A_4$. For an $n\times m$ array $A = (a_{ij})$, the {\it reversal} of $A$ is the $n\times m$ array $(b_{ij})$ such that $b_{ij} = a_{n-i+1, m-j+1}$ for every $(i,j)$. We denote it by $R(A)$ (see Figure 3). The transformation $\frac{\displaystyle A_1}{\displaystyle A_2} \Rightarrow \frac{\displaystyle R(A_1)}{\displaystyle A_2}$ is called a horizontal prefix reversal. The transformation $A_3 | A_4 \Rightarrow R(A_3) | A_4$ is called a vertical prefix reversal.
Unlike the pancake sorting problem, it is not clear whether an array can be rearranged to a standard array. A necessary and sufficient condition that an array can be rearranged is obtained in [7]. A rearrangement is always possible if and only if either the number of rows or the number of columns is not divisible by 4. Moreover, only even permutations can be realized in the case that the numbers of both rows and columns are divisible by 4. We shall extend a rearrangement problem of two dimensional arrays toward two directions.
Another variant of the pancake sorting problem is called the burnt pancake sorting problem and studied many authors (e.g. [3], [6]). It is to sort a given list of signed integers into an ascending order of positive integers. Instead of signed integers, we use upper and lower case letters (or blue and red letters). In this problem, cases of letters are altered whenever we apply a prefix reversal.
\begin{equation*} \begin{pmatrix} e & d & A & c & b \end{pmatrix} \Rightarrow \begin{pmatrix} B & C & a & D & E \end{pmatrix} \Rightarrow \begin{pmatrix} c & b & a & D & E \end{pmatrix} \Rightarrow \begin{pmatrix} A & B & C & D & E \end{pmatrix} \end{equation*}
We shall generalize the burnt pancake problem to a rearrangement problem of two dimensional arrays filled with upper and lower case letters by prefix reversals. It is clear that the problem of linear strings ($1 \times n$ arrays) is equivalent to the original burnt pancake sorting problem. Although we have an algorithm to sort a linear string, a rearrangement problem for two dimensional arrays filled with upper and lower case letters is not trivial. The array below can be rearranged form the standard array, where the standard array of size $n \times m$ is the array in which every cell is filled with an upper case letter and is denoted by $E_{n\times m}$. We suppose $U$ is a set of upper case letters $\{ A,B,C,D,E,F\}$, $L$ is a set of lower case letters $\{ a,b,c,d,e,f\}$ and $c$ is a one-to-one correspondence between same letters with the opposite cases.
\begin{equation*} \begin{bmatrix} c & f & e \\ b & D & A \end{bmatrix} \Rightarrow \begin{bmatrix} E & F & C \\ b & D & A \end{bmatrix} \Rightarrow \begin{bmatrix} d & B & C \\ f & e & A \end{bmatrix}\Rightarrow \begin{bmatrix} a & E & F \\ c & b & D \end{bmatrix} \Rightarrow \end{equation*} \begin{equation*} \begin{bmatrix} f & e & A \\ c & b & D \end{bmatrix} \Rightarrow \begin{bmatrix} d & B & C \\ a & E & F \end{bmatrix} \Rightarrow \begin{bmatrix} A & B & C \\ D & E & F \end{bmatrix} \end{equation*}
Furthermore, we extend the rearrangement problem to a rearrangement problem of three dimensional arrays using the result for a rearrangement problem of two dimensional arrays filled with upper and lower case letters by prefix reversals. We discuss rearrangability of three dimensional arrays by prefix reversals.
This is a joint work with Riki Kase and Szilárd Zolt Fazekas.