Zipf's law

To a fairly good approximation the n^{th} most common word in a large sample of English text occurs with frequency 1/n, as illustrated in the first picture below. This fact was first noticed around the end of the 1800s, and was attributed in 1949 by George Zipf to a general, though vague, Principle of Least Effort for human behavior. I suspect that in fact the law has a rather simple probabilistic origin. Consider generating a long piece of text by picking at random from k letters and a space. Now collect and rank all the "words" delimited by spaces that are formed. When k = 1, the n^{th} most common word will have frequency c^{-n}. But when k ≥ 2, it turns out that the n^{th} most common word will have a frequency that approximates c/n. If all k letters have equal probabilities, there will be many words with equal frequency, so the distribution will contain steps, as in the second picture below. If the k letters have non-commensurate probabilities, then a smooth distribution is obtained, as in the third picture. If all letter probabilities are equal, then words will simply be ranked by length, with all k^{m} words of length m occurring with frequency p^{m}. The normalization of probabilities then implies p = 1/(2k), and since the word at rank roughly k^{m} then has probability 1/(2k)^{m}, Zipf's law follows.