NKS 2006 Conference & Minicourse: Cellular Automata for Information Retrieval

Cellular Automata for Information Retrieval

Ed Meier

LexisNexis

Cellular automata (CA) have not yet been applied to the field of information retrieval (IR), finding documents in an unindexed corpus based on a small set of query words. This paper presents an initial study of the problem and identifies areas of research to be explored.

Conventional IR relies on statistical or probabilistic techniques, math. Cellular automata, as presented by Stephen Wolfram in A New Kind of Science, offer an entirely new approach. A document can be represented by a series of bits. Another bit string can be generated to represent a query. How much these two overlap identifies the quality of the match between a query and a document. No real math is involved.

This paper shows how a representation of documents, a corpus, might be created that can be queried upon. The search space of the documents could be a stack of bit strings, one representing each document. The bit strings that represent each document are generated by NKS-style cellular automata. The bit string to represent the query words is also generated through CA and compared to the document bit strings.

The methodology explored was whether CA provided any re-ordering of words in the CA output that would provide a signature step that could be useful in searching for a document. Cellular automata output is seemingly random, but the output appears random simply because we do not understand it yet. Someone could show algebra, calculus, and statistical functions to a fifth grade student and tell the student that they all just generate random numbers. The fifth-grader would have no reference but to believe. The same is true with CA today. We have functions that generate output that we do not understand yet. It is possible that one of these CA rules, at one of its output steps, generates a signature line for a document that will provide perfect recall and precision in a search of unindexed documents.

To accomplish this a search is conducted for the best rule, from the set of random generating ones. Each rule is scored based on an overall score of how well it works compared to other rules using a standard query for the same document.

A search is also conducted for the best step to represent a document. As we move farther from the initial step we evaluate whether or not the scores for the document searches are getting better or worse. We also search the CA output space around where all bits have had a chance to interact. Each bit in this algorithm represents a word in the document. Where they all interact is intuitively the best step to represent a document. It might be a good “signature step” for documents.

Created by Mathematica (May 11, 2006)