Pointer-based encoding

One can encode a list of data d by generating pointers to the longest and most recent copies of each subsequence of length at least b using

PEncode[d_, b_ : 4] := Module[{i, a, u, v}, i = 2; a = {First[d]}; While[i ≤ Length[d], {u, v} = Last[Sort[Table[{MatchLength[d, i, j], j}, {j, i - 1}]]]; If[u ≥ b, AppendTo[a, p[i - v, u]]; i += u, AppendTo[a, d〚i〛]; i++]]; a]

MatchLength[d_, i_, j_] := With[{m = Length[d] - i}, Catch[Do[If[d〚i + k〛 =!= d〚j + k〛, Throw[k]], {k, 0, m}]; m + 1]]

The process of encoding can be made considerably faster by keeping a dictionary of previously encountered subsequences. One can reproduce the original data using

PDecode[a_] := Module[{d = Flatten[a /. p[j_, r_] Table[p[j], {r}]]}, Flatten[MapIndexed[If[Head[#1] === p, d〚#2〛 = d〚#2 - First[#1]〛 ,#1] &, d]]]

To get a representation purely in terms of 0 and 1, one can use a self-delimiting representation for each integer that appears. (Knowing the explicit representation one could then determine whether each block would be shorter if encoded literally or using a pointer.) The encoded version of a purely repetitive sequence of length n has a length that grows like Log[n]. The encoded version of a purely nested sequence grows like Log[n]^{2}. The encoded version of a sufficiently random sequence grows like n (with the specific encoding used in the text, the length is about 2n). Note that any sequence of 0's and 1's corresponds to the beginning of the encoding for some sequence or another.

It is possible to construct sequences whose encoded versions grow roughly like fractional powers of n. An example is the sequence Table[Append[Table[0, {r}], 1], {r, s}] whose encoded version grows like √n Log[n]. Cyclic tag systems often seem to produce sequences whose encoded versions grow like fractional powers of n. Sequences produced by concatenation sequences are not typically compressed by pointer encoding.

With completely random input, the probability that the length b subsequence which begins at element n is a repeat of a previous subsequence is roughly 1 - (1 - 2^{-b})^{n - 1}. The overall fraction of a length n input that consists of repeats of length at least b is greater than 1 - 2^{b}/n and is essentially

1 - Sum[(1 - 2^{-b})^{i} Product[1 + (1 - 2^{-b})^{j} - (1 - 2^{-b - 1})^{j}, {j, i - b + 1, i - 1}], {i, b, n - b}]/(n - 2b + 1)