Rememberall: CLI Document Retrival using Bayesian Inference
A long standing tradition in scientific research is to keep detailed notes on everything as it happens. This studious attention to detail not only makes analysis and paper writing much easier, but also serves as a record of exactly how an experiment was performed should it need to be repeated in the future. By looking though a lab notebook, an experiment can be repeated exactly, and results can be verified.
Although I have since moved away from the laboratory, I still keep detailed
records of my work in a series of markdown files detailing the steps taken
as I perform data analyses and develop software. Over the last few months,
however, I’ve found that the volume of my notes has grown to large to simply
grep
for keywords.
To make it easier for me to find project- or task-specific development notes, I developed a Rust-based CLI tool called Rememberall, which uses term frequency and Bayesian inference to retrieve documents relevant to a query of keywords.
Rememberall has two main functions: indexing a corpus of documents to identify documents and tablulate term counts, and to provide an interface for querying the corpus for documents based on keywords. To generate the index, it first loads a corpus of documents into memory, splits sentences into single words, and stems each word into a root token. These tokens are then counted relative to both the document and the corpus. This index of document, token, counts is then stored in an index file for later use.
Once each document has been indexed, we can perform queries. When the user
performs a query, such as rememberall search cataloger deployed
, each keyword
(“cataloger” and “deployed”) is stemmed, and the keyword tokens are used to
load all possibly related documents from the index. This is where we start
using Bayesian inference. Bayes’ theorem in its most general
form can be described in order of decreasing notation as
In our case, we are provided with some keywords and we want to find relevant documents with those keywords. With this in mind, we can adapt Bayes’ theorem to yield the probability that a document is relevant as follows
There are a few considerations to be had with this approach, however, that require a small amount of additional computation. Firstly, we will be performing this calculation for all documents in the corpus which contain the keywords. This means that we need the counts for all occurrences of every keyword. Secondly, the normalizing constant is based on the probability of the keyword occurring in any document in the corpus, so we must utilize the entire index to generate this number.
Rememberall implements this type of search by loading the index of tokens, finding all occurrences of each keyword in the corpus and dividing that integer by the number of tokens in the corpus, thus finding the normalization constant for the term. Then, for each document, the frequency of each keyword is calculated (the likelihood), and the posterior probability is calculate for the specific document/keyword pair. When all of the posterior probabilities have been calculated for a document, we make the assumption that each posterior probability is independent, and multiply the posteriors to find the probability of the document being relevant given all of the keywords were observed.
With everything put together, I am able to find relevant documents using simple command lines very quickly and efficiently using basic maths. Here is a quick video showing usage and output.
If this project interests you, please check out the rememberall
git repository and contribute!