Published: 17th August 2011 DOI: 10.4204/EPTCS.63 ISSN: 2075-2180 |
This volume contains proceedings of the Eighth International Conference WORDS 2011. It was held in Prague, Czech Republic, from 12th to 16th September 2011 as a joint undertaking of the Czech Technical University and the Charles University.
The conference is a biannual meeting devoted to the mathematical theory of words, i.e., finite or infinite sequences of symbols over a finite alphabet. Since the first meeting in 1997 in Rouen, France, WORDS became the main event in the field. Words arise as a natural object in many areas and therefore the conference is also open to applications, be it in computer science, biology, linguistics or physics, while the emphasis remains on combinatorial, algebraic and algorithmic points of view.
Program committee selected 27 contributions out of 40 submissions to be presented at the conference. Each paper was reviewed by two referees. The reader will find here revised versions of the accepted contributions as well as summaries of six invited lectures.
We thank authors for their interest in WORDS, program committee members and other reviewers (listed below) for their effort, and the EPTCS editors, in particular Wolfgang Thomas and the Editor-in-Chief Rob van Glabbeek, for accepting the publication of the proceedings and for their helpful approach.
The event was financially supported by the Faculty of Nuclear Sciences and Physical Engineering of the Czech Technical University, by the Faculty of Mathematics and Physics of the Charles University, by the Doppler Institute for Mathematical Physics and Applied Mathematics, grant LC06002, and by the Czech Science Foundation, grant GAČR 201/09/0584.
There are two attributes by which the Prague Linguistic School is generally characterized: 'structural' and 'functional'. While 'structural' is a common denominator of several linguistic trends that originated in the first decades of the 20th century following Ferdinard de Saussure's pioneering linguistic approach, the term 'functional' was used by de Saussure only quite occasionally and it is supposed to be a distinctive feature of the writings of the Prague scholars, who at the same time as they recognized the necessity to describe and explain the collection of language phenomena as a structured whole rather than as a mechanical agglomeration, they emphasized that this structured whole — language — should be understood as a functioning means of communication.
In our contribution, we would like to characterize the Prague functionalism by pointing out some basic features of this approach as they are reflected in the Prague School writings both from the pre-war period as well as in the analyses of those linguists belonging to what may be called "second generation" of the Prague School. We focus on three issues which we consider to be of a more general interest:
(i) The notion of function in limguistic studies: The difficulty of an adequate characterization of the use of the term function and its derivatives consists in the fact that the very term function is homonymous and its use varies also in different scientific domains (Nagel 1961 distinguishes six various meanings of the term function in modern science). The ambiguity of this term can be demonstrated also on the different understanding of function by the Prague School. It was mainly Roman Jakobson who treated the concept of function in linguistics in the general theoretical framework of finalism or teleology. In his understanding, the claim "A phenomenon x is a means for the realization of an end F" is equivalent to the claim "A phenomenon x has a function f", that is to say that to have a function f is equivalent to "to serve as a means for the end (purpose) F". The teleological approach was reflected by Jakobson in his principle of 'therapeutic changes' in the development of language: language system always tends to a certain balance and the distortion of this balance initiates changes — which removes this insufficiency — but evokes imbalance in some other parts of the system and the process of therapeutic changes continues ad infinitum.
(ii) The articulation of the form — function relation into levels: The need for a systematic and integrated description of the relation of functions and forms has led to conceive the core of language system as consisting of levels: the units of the lower level may be looked upon as representations of the units levels can be understood as representations (forms) of the units of the adjacent higher levels, which in turn are viewed as functions of the adjacent lower levels units. From the methodological standpoint, there are two possible ways how to proceed: either to adopt the speaker's point of view and to proceed from function to form; i.e., from common needs of communication, or to take an opposite point of view and to proceed from form to function. A far-reaching significance for the understanding of the relations between units of adjacent levels is the notion of asymmetrical dualism introduced by S. Karcevskij (1929). The main idea consists in the recognition that the sign and its meaning (or rather function; Karcevskij uses the French term signification) do not cover in all their points the same field: the same sign has several functions (e.g. the morpheme -a in Czech may have the function of Nominative singular of feminine nouns, as in \v{z}en-a, or Nominative plural of neuter nouns as in m\v{e}st-a, and so on), and the same function can be expressed by several signs (e.g. the function of Nominative singular of feminine nouns can be expressed by the morpheme -a, or -e, or zero). There is always a certain tension between signifiant and signifi\'e and the asymmetrical dualism of the structure of the sign makes it possible for language to develop.
(iii) The communicative role of language: Language is regarded as a functioning system, adapted to its communicative role, diversified in more or less different social and local varieties, open to possible "variation", to change; this is reflected in the system itself: the system has a stable (solid central) core and peripheral domains (irregularities, obsolete or recent marginal phenomena) need not be in complete accordance with the laws and tendencies governing the central core. One of the important contributions of Prague scholars is their systematic attention paid to the information structure of sentences, i.e., in brief, to the distinction between the topic of the sentence("what is the sentence 'about'") and its focus ("what the sentence says about the topic"). Praguian scholars were the first to emphasize that this distinction has a semantic relevance.
In the full version of the paper, the above mentioned aspects will be discussed both from the historical perspective and the present day tenets of modern Prague theoretical linguistics and they will be illustrated by rich empirical material.
A pattern is a finite word of variables. A word w avoids pattern P if for any substitution φ of the variables of P with non-empty words, φ(P) is not a factor of w. Let Σ_{i} denote the i-letter alphabet {0,1,…,i−1}. The avoidability index λ(P) of P is the smallest integer k such that there exists an infinite word w over Σ_{k} avoiding P.
An infinite word w is said to be HDOL if it is the image by a morphism h of the fixed point of a morphism m, that is, w=h(m^{ω}(0)). Cassaigne conjectures that for every avoidable pattern P, there exists an HDOL word over Σ_{λ(P)} avoiding P.
Lower bounds on λ(P) are quite easy to obtain by backtracking, and Cassaigne's conjecture is supported by the fact that most of the proofs of upper bounds consist in finding a suitable HDOL word over Σ_{λ(P)} and showing that it avoids P.
For a given pattern, there can exist many HDOL words. For example in the case of squares, we have that λ(AA)=3 and the fixed point m^{ω}(0) of any ternary square-free morphism m avoids squares. On the other hand, it is known that λ(ABWACXBAYCAZCB)=4 and essentially the only infinite word over Σ_{4} avoiding ABWACXBAYCAZCB is the fixed point of 0→01, 1→21, 2→03, 3→23.
Thue proved that ternary square-free words avoiding 010 and 212 that are bi-prolongable are exactly the factors of the fixed point t of 0→012, 1→02, 2→1. More recently, we proved that the only infinite binary word avoiding AABBCABBA and the factors 0011 and 1100 is essentially the image of t by the morphism
0→ | 0010110111011101001, |
1→ | 00101101101001, |
2→ | 00010. |
In this talk, we consider the possibility that for every avoidable pattern P, there exists a finite set S of forbidden patterns and factors, containg P, such that the words over Σ_{λ(P)} are essentially the factors of an HDOL word. This is a strong version of Cassaigne's conjecture. I will give many examples of such HDOL words characterized by forbidden patterns and factors, as well as related open problems. We will also discuss the factor complexity of words avoiding patterns.
In this paper, we study an algebraic and combinatorial problem concerning bounded context-free languages. A strictly related notion is that of sparse language: a language L is termed sparse if its counting function, that is, the function f_{L} that maps every integer n≥0 into the number f_{L}(n) of words of L of length n, is polynomially upper bounded. Sparse languages play a meaningful role both in Computer Science and in Mathematics and have been widely investigated in the past [3, 4, 8, 10, 11, 12, 13, 15, 17]. The interest in this class of languages is due to the fact that, in the context-free case, it coincides with the one of bounded languages [11, 15]. A language L is termed bounded if there exist n words u_{1}, …, u_{n} such that L ⊆ u_{1}^{*}⋯ u_{n}^{*}.
The starting point of this investigation is the following result proved in [3]: for every sparse context-free language L_{1}, there exists a regular language L_{2} such that f_{L1} = f_{L2}. Therefore, the counting function of a sparse context-free language is always rational. This result is remarkable since, as it is well known [7], the counting function of a context-free language may be transcendental.
A conceptual limit of the above mentioned construction is that the regular language L_{2} is on a different alphabet from the one of L_{1}. Therefore it is natural to ask whether this limitation can be removed. Actually an even more general question can be asked. For this purpose, some preliminary notions are needed. Let A = {a_{1}, …, a_{t}} be an alphabet of t letters and let ψ: A^{*}→ N^{t} be the corresponding Parikh morphism. Given two languages L_{1} and L_{2} over the alphabet A, we say that L_{1} is commutatively equivalent to L_{2} if there exists a bijection f: L_{1}→ L_{2} from L_{1} onto L_{2} such that, for every u∈ L_{1}, ψ(u) = ψ(f(u)). This notion is very important in the theory of variable-length codes since it is involved in the celebrated Schützenberger conjecture about the commutative equivalence of a maximal finite code with a prefix one (see e.g. [1]).
Now the general problem above can be formulated as follows: given a bounded context-free language L_{1}, does there exist a regular language L_{2} which is commutatively equivalent to L_{1}?
In the sequel, for short, we refer to it as CE (Commutative Equivalence) Problem. The CE Problem was conjectured to have an affirmative solution by Stefano Varricchio and told to the authors during their past collaboration. In this paper, we prove the conjecture. More precisely, we prove the following statement.
Theorem 1 Every bounded context-free language L_{1} is commutatively equivalent to a regular language L_{2}. Moreover the language L_{2} can be effectively constructed starting from an effective presentation of L_{1}.
Observe that the CE problem can be seen as a kind of counting problem in the class of bounded context-free languages. Despite the fact that such class has been widely investigated in the past, CE problem appears to require some new techniques. Indeed, bounded context-free languages can be inherently ambiguous; in addition, if u_{1}^{*}⋯ u_{n}^{*}, u_{i}∈ A^{+}, is the set that contains the bounded context-free language, then, in general, u_{1}^{*}⋯ u_{n}^{*} is ambiguous as product of languages of A^{*}. Such ambiguities, which are of different nature, interfere making the study of the problem a non trivial task.
Now we would like to give a broader picture about the relationships between the main result of this paper and some classical theorems on bounded context-free languages. The first result that deserves to be mentioned is a well-known theorem by Parikh [16]. For this purpose, let us first introduce a notion. Given two languages L_{1} and L_{2} over the alphabet A, we say that L_{1} is Parikh equivalent to L_{2} if ψ(L_{1}) = ψ(L_{2}). The theorem by Parikh states that, given a context-free language L_{1}, its image ψ(L_{1}) under the Parikh map is a semi-linear set of N^{t}. As a straighforward consequence of Parikh theorem, one has that there exists a regular language L_{2} which is Parikh equivalent to L_{1}.
It is worth noticing that the property of Parikh equivalence by no means implies the property of commutative equivalence. Indeed, let A = {a, b} and let L_{1} = (ab)^{*} ∪ (ba)^{*} and L_{2} = (ab)^{*}. One has that ψ(L_{1}) = ψ(L_{2}) = (1,1)^{⊕} (the symbol ⊕ denotes the Kleene star operation in the monoid N^{2}) so that L_{1} is Parikh equivalent to L_{2}. On the other hand, one immediately checks that L_{1} cannot be commutatively equivalent to L_{2}.
Another theorem that is central in this context has been proved by Ginsburg (see [8], Theorem 5.4.2). For this purpose, let us first introduce a notion. Let L⊆ u_{1}^{*}⋯ u_{n}^{*} be a bounded language where, for every i=1, …, n, u_{i} is a word over the alphabet A. Let φ: N^{n} → u_{1}^{*}⋯ u_{n}^{*} be the map defined as: for every tuple (l_{1}, …, l_{n}) ∈ N^{n}, φ(l_{1}, …, l_{n}) = u_{1}^{l1}⋯u_{n}^{ln}. Ginsburg proved that L is context-free if and only if φ^{−1}(L) is a finite union of linear sets, each having a stratified set of periods. Roughly speaking, a stratified set of periods corresponds to a system of well-formed parentheses. However, Ginsburg theorem is of no help to study counting problems and, in particular, our problem, because of the ambiguity of the representation of such languages. Indeed, let A = {a, b,c} be a three letter-alphabet and let the language L = {a^{i}b^{j}c^{k} | i,j,k ∈ N, i=j or j=k } [2]. Since L is inherently ambiguous, by [8] Theorem 6.2.1, L cannot be represented unambiguously as a finite disjoint union of a stratified set of periods. In this context, another important recent result that gives a characterization of bounded context-free languages in terms of finite unions of Dyck loops has been proven in [12]. However, neither this latter result can be used to deal with our problem because of the ambiguity of the representation of such languages as a finite union of Dyck loops.
Due to the space limitation, we are not able to report our proof of Theorem 1. Therefore, we limit ourselves to give an idea of the main techniques involved in the proof. A basic tool that has been used to study our problem is the celebrated theorem by Eilenberg and Schützenberger that provides an algebraic characterization of semi-linear sets in terms of semi-simple sets ([6,14). The proof is essentially divided in two steps. The first deals with the case where the bounded context-free language is described, via the Ginsburg map φ, by a linear set. In this case, the proof makes use of a technique of combinatorics on words concerning variable-length codes and of a new technique (inspired to our work [5]) of geometrical nature for the decomposition of linear sets. The second step deals with the general case, that is, when the language is described, via the Ginsburg map φ, by a semi-linear set. In this step, together with the mathematical tools mentioned above, some arguments of elementary number theoretical nature are used to get our main theorem (see e.g. [9]).