N

Rememberall: CLI Document Retrival using Bayesian Inference

17 Sep 2016

A long standing tradition in scientific research is to keep detailed notes on everything as it happens. This studious attention to detail not only makes analysis and paper writing much easier, but also serves as a record of exactly how an experiment was performed should it need to be repeated in the future. By looking though a lab notebook, an experiment can be repeated exactly, and results can be verified.

lab notebook

Figure 1: A laboratory notebook used to record experiment setup, observations, ideas, data, and analysis results. Laboratory notebooks are permanent records of the events that transpired during an experiment, an experimenters thoughts and observations during an experiment, and the experimental results. These records are an invaluable resource when communicating research, and are often a legally binding record of research that was conducted.

Although I have since moved away from the laboratory, I still keep detailed records of my work in a series of markdown files detailing the steps taken as I perform data analyses and develop software. Over the last few months, however, I’ve found that the volume of my notes has grown to large to simply grep for keywords.

To make it easier for me to find project- or task-specific development notes, I developed a Rust-based CLI tool called Rememberall, which uses term frequency and Bayesian inference to retrieve documents relevant to a query of keywords.

remembrall

Figure 2: A remembrall shown in Harry Potter and the Sorcerer's Stone. A remberall, as created by J. K. Rowling in her book Harry Potter and the Sorcerer's Stone, is small glass ball that contains a smoke that turns red when its own has forgotten something, and turns clear when whatever was forgotten has been remembered. The comedic downside to the remembrall is that it does not help the owner identify what has been forgotten. This is in contrast to [Rememberall][rememberall], which specifically helps jog the user's memory of a previous note or event. As an added benefit Rememberall isn't missing an "e".

Rememberall has two main functions: indexing a corpus of documents to identify documents and tablulate term counts, and to provide an interface for querying the corpus for documents based on keywords. To generate the index, it first loads a corpus of documents into memory, splits sentences into single words, and stems each word into a root token. These tokens are then counted relative to both the document and the corpus. This index of document, token, counts is then stored in an index file for later use.

index-process

Figure 3: The document indexing process that rememberall uses. Rememeberall loads each document, formats, cleans, and stems the text, calculates token frequency, and generates metadata about the document. Then, the term frequency is calculated across the entire corpus and the resulting counts are stored in an index file and a corpus file for later usage. .

Once each document has been indexed, we can perform queries. When the user performs a query, such as rememberall search cataloger deployed, each keyword (“cataloger” and “deployed”) is stemmed, and the keyword tokens are used to load all possibly related documents from the index. This is where we start using Bayesian inference. Bayes’ theorem in its most general form can be described in order of decreasing notation as

In our case, we are provided with some keywords and we want to find relevant documents with those keywords. With this in mind, we can adapt Bayes’ theorem to yield the probability that a document is relevant as follows

There are a few considerations to be had with this approach, however, that require a small amount of additional computation. Firstly, we will be performing this calculation for all documents in the corpus which contain the keywords. This means that we need the counts for all occurrences of every keyword. Secondly, the normalizing constant is based on the probability of the keyword occurring in any document in the corpus, so we must utilize the entire index to generate this number.

Rememberall implements this type of search by loading the index of tokens, finding all occurrences of each keyword in the corpus and dividing that integer by the number of tokens in the corpus, thus finding the normalization constant for the term. Then, for each document, the frequency of each keyword is calculated (the likelihood), and the posterior probability is calculate for the specific document/keyword pair. When all of the posterior probabilities have been calculated for a document, we make the assumption that each posterior probability is independent, and multiply the posteriors to find the probability of the document being relevant given all of the keywords were observed.

search-process

Figure 4: The query process using Bayes' theorem. The query process loads the corpus and index files into memory to utilize the indexed token frequencies. Then, each keyword is stemmed and used to find all documents in the corpus. For each document utilizing the keyword token, calculate the posterior probability of the document being relevant based due to the keyword token being present. If this process has been performed previously for a another keyword in the search string, multiply the existing posterior probability by the newly calculated posterior probability. Once each keyword has been processed, sort the documents in descending order by posterior probability and display the top _n_ results. .

With everything put together, I am able to find relevant documents using simple command lines very quickly and efficiently using basic maths. Here is a quick video showing usage and output.



If this project interests you, please check out the rememberall git repository and contribute!