Traces of the Mouth
Andrei Andreyevich Markov's Mathematization of Writing
David Link

This article discusses in detail two works by the Russian mathematician Andrei Andreyevich Markov Sr (1856–1922). They represent an early and momentous attempt to understand the phenomenon of language in mathematical terms. Outside of the strictly mathematical field, Markov's achievements are only very rarely discussed.

In the works, he counted the frequency of vowels and consonants in Pushkin's Eugene Onegin and another text, and analysed the results with the mathematical tools of the probability theory of his time. In what follows, I give a brief account of the role that letters played in probability theory up to this point. The understanding of language in these concepts was so weak that it did not even allow very simple problems to be solved. I then describe Markov's analysis in detail.

Since 1906, Markov's work had extended certain concepts of probability theory, which were considered to apply only to independent trials, to the field of dependent variables on a purely theoretical basis. In Pushkin's text Markov found for the first time material to verify his assumptions empirically. Since this was his primary interest, he made no further comment on the meaning of his findings.

The first astonishing result was that the distribution of vowels and consonants followed a 'normal' distribution. Although Markov did not say as much, this means that at the source of language lies a random process. I attempt to find an explanation for this in the lectures of the Swiss linguist Ferdinand de Saussure, which he gave at approximately the same time and which offer a helpful theory on the collective genesis of language. Even though it is probable that Markov and Saussure were unacquainted with each other's work, they shared a strong interest in formalization and an approach that is differential rather than substantial. By applying Markov's analysis to randomly selected words, I demonstrate first that it is not the individual style of an author that produced the observed randomness. It is the fact, as stated by Saussure, that language is formed in an unconscious, collective process. Certain individuals begin to speak differently and their changes to the language may or may not be accepted by others.

Markov's second result was that the dispersion of this random distribution is much smaller than would be expected. Again, Markov applied the theoretical formulae of his earlier work only to verify their validity, and did not enquire as to the reason for this phenomenon. Saussure's theory provides an explanation: the few individuals who start to speak differently are subject to the physical constraints imposed by the mouth and thus cannot recombine letters completely at random. Therefore, Markov's method also determines the degree to which written text represents orality and this allows a much firmer grip on language than probability theory had achieved before Markov.

To generate a completely random text as a comparison and to destroy any dependence between the letters, Markov wrote the text row by row into a table and read out the columns vertically. I show that this technique was inspired by the cryptography of his time and that even Pushkin, the author studied by Markov, used this technique to encrypt the politically dangerous tenth chapter of Eugene Onegin.

Traces of the Mouth. Andrei Andreyevich Markov's Mathematization of Writing. History of Science 44.145 (2006): 321-348.
Full Article as PDF

Source code of Russian word list programme
Programme output

Chains to the West
Markov's Theory of Connected Events and Its Transmission to Western Europe
David Link

At the beginning of the twentieth century, the Russian mathematician Andrey A. Markov extended the laws of the calculus of probability to trials that were dependent on each other, purely theoretically. In two articles from 1913, which are presented here in English translation, he applied his theory for the first time to empirical material, namely text. After a presentation of Markov's methods, results, and possible inspirations, the introduction investigates the dissemination of his ideas to Western Europe and North America in detail. The experimental application of his method to various types of text finally determines its scope.

Chains to the West. Markov's Theory of Connected Events and its Transmission to Western Europe. Science in Context 19.4 (2006): 561-589.
Full Article as PDF

Classical Text in Translation: An Example of Statistical Investigation of the Text Eugene Onegin Concerning the Connection of Samples in Chains. Science in Context 19.4 (2006): 591-600.
Translation of "An Example of Statistical Investigation of the Text Eugene Onegin Concerning the Connection of Samples in Chains", A. A. Markov, 1913 - Full Text as PDF

Classical Text in Translation: On a Remarkable Case of Samples Connected in a Chain. Appendix on the statistical investigation of a text by Aksakov. Science in Context 19.4 (2006): 601-604.
Translation of "On a Remarkable Case of Samples Connected in a Chain", A. A. Markov, 1924 - Full Text as PDF