• # Rememberall: CLI Document Retrival using Bayesian Inference

17 Sep 2016

A long standing tradition in scientific research is to keep detailed notes on everything as it happens. This studious attention to detail not only makes analysis and paper writing much easier, but also serves as a record of exactly how an experiment was performed should it need to be repeated in the future. By looking though a lab notebook, an experiment can be repeated exactly, and results can be verified.

Figure 1: A laboratory notebook used to record experiment setup, observations, ideas, data, and analysis results. Laboratory notebooks are permanent records of the events that transpired during an experiment, an experimenters thoughts and observations during an experiment, and the experimental results. These records are an invaluable resource when communicating research, and are often a legally binding record of research that was conducted.

Although I have since moved away from the laboratory, I still keep detailed records of my work in a series of markdown files detailing the steps taken as I perform data analyses and develop software. Over the last few months, however, I've found that the volume of my notes has grown to large to simply grep for keywords.

To make it easier for me to find project- or task-specific development notes, I developed a Rust-based CLI tool called Rememberall, which uses term frequency and Bayesian inference to retrieve documents relevant to a query of keywords.

• # An N-Race Schelling Segregation Model

21 Jan 2016

It is a common theme in non-linear modeling that small perturbations in initial conditions can result in massive deviations in the outcome of a simulation. In 1969, Nobel laureate Thomas Schelling explored how even small biases can have large sociological effects. In his paper "Models of segregation", Schelling described how a model in which a preference that one's neighbor's be of a specific mixture can lead to total segregation regardless of intent.

Figure 1: A 500x500 Schelling segregation model with 3 races. This simulation was conducted with a maximum minority threshold of 0.2 for 1000 ticks.

Using this concept as a starting point, I developed a Java-based application to simulate a Schelling segregation model for an arbitrary number of races (n>=1).

• # Estimating Epulopiscium Sp. Type B Chromosome Density Using Computer Vision

21 Apr 2015

In addition to the epidemiology presentation this morning at SUNY Geneseo's 9th Annual GREAT Day symposium, I also presented a poster with Matthew Taylor on the use of computer vision in the localization of fluorescently labeled genomes in the extremely polyploid Epulopiscium sp. Type B. As a test-case for this technology, we used the coordinates for the localized chromosomes to estimate the chromosome density for cells during different life stages. For those interested, the poster is a good read.

As a part of our presentation, we presented a 3D model of chromosomes localized from a cell that forming daughters. The model is rendered in WebGL using the three.js library with support for both mouse and Leap Motion control.

Figure 1: Live demonstration of 3D chromosome distribution generated using computer vision. This figure is a live 3D demonstration of the spatial structure of an Epulopiscium cell's chromosomes. This model can be rotated by clicking and dragging or by using a Leap Motion. No, seriously. Try it.

• # European Flight Restrictions May Inhibit International Propagation of Ebola

21 Apr 2015

Today was SUNY Geneseo's 9th Annual GREAT Day, a college-wide symposium of creativity and academic research. This morning Mathew Taylor and I presented our metapopulation network model for simulating international the spread of Ebola via aviation. Using real flight data donated by FlightAware, airport data from OpenFlights.org, and a gridded population of the world, we constructed a model consisting of over 3,000 airports, 82,000 routes, and over 4 billion individuals. Using this model, we tested the efficacy of country-based flight regulations in preventing the international spread of Ebola.

• # Traffic Analysis for the Culver Road and East Main Street Intersection

16 Apr 2015

This past weekend was UP-Stat 2015, and with it was this year's data competition. I keeping with this year's theme of "statistical modeling in the era of data science," this year's data competition was an analysis of traffic data collected from the intersection of Culver Road and East Main Street in Rochester, NY. I decided to take a stab at analyzing the data with my friends Matthew Taylor and Tom Hartvigsen, and our analysis was presented as a finalist at the conference!

Feel free to read our report. If you are interest in our analysis or the competition data, it is available on github.

• # Analyzing Geneseo's Yik Yak Community

13 Feb 2015

Due to the nature of the data analyzed this post contains strong language.

The pseudo-anonymous messaging app Yik Yak has been making waves in high schools and college communities for the last year, with officials and community claiming that the application promotes bullying and hate speech. In March of 2014, Dr. Keith Ablow wrote an opinion piece for Fox News stating that "Yik Yak is the most dangerous app [he'd] ever seen." SUNY Geneseo, a small rural college with a campus population near 5,000 students, has a burgeoning Yik Yak community with hundreds of users. Like other college campuses, Geneseo's campus is in the debating the social pros and cons to the type of anonymous forum that Yik Yak provides.

Figure 1: Peter Steiner's famous cartoon published by The New Yorker on July 5, 1993. Steiner's comic showed the general transition of the Internet out of the hands of purely government and academic use into the hands of everyday individuals. It also touches on the pseudo-anonymous aspects of Internet culture providing a disconnect between individual identity and community membership.

Regardless of the rhetoric and vitriol surrounding the app and cherry-picked examples to be lambasted by the media, very few people have put their money where their mouth is in terms of the content on Yik Yak. Is Yik Yak really as socially dangerous as Dr. Ablow would have people believe, or is it simply a diverse and vocal community like other social networks?

• # Creating Cellular Automata: Life-like Cellular Automata

01 Jun 2014

In a continuation of understanding models of life, one of the most interesting cellular automatons is a two dimensional "life-like" automaton. The first life-like automaton was created by John Conway in 1970, and was published in the October 1970 release of Scientific American. The intrigue that surrounds the automaton comes from the emergence and self-organization of highly complex patterns as the simulation evolves. As a result, these automata have attracted the interest of computer scientists, mathematicians, biologists, and physicists.

Figure 1: An example of life-like cellular automata. This simulation is represented as Unicode characters from an automaton implemented in python. White characters represent cells that are alive. As the simulation progresses, it approaches a state of stability and order. Some patches of complexity remain and migrate through the world.

Let's examine the rules of Conway's cellular automaton and see if we can implement a simple life-like cellular automaton in python.

• # Creating Cellular Automata: Elementary Cellular Automata

03 May 2014

Sometimes the best way to learn about something is to create it. If I want to learn about an interesting subset of mathematics, eigenvalues perhaps, the explanation of the math can only go so far. I must let my pencil do the talking as I learn through construction. Programming is no different. So, when I became interested in cellular automata, I decided to make some examples of different types. Today, we can go over elementary automata.

Figure 1: An example of elementary cellular automata following rule 110. Each row in the image represents a segment of time in a time series from top to bottom. Each back space represent an organism, and each white space represents a dead organism. By defining a rule set, we can determine the outcome of the system. Rule 110 is named as the binary equivalent to the binary series 01101110.

Elementary automata are particularly interesting for two reasons:

1. The limited number of rules and interactions make them very easy to study.
2. The visual nature of time allows for deep investigation into changing patterns.

With these two characteristics, elementary cellular automata have become a tool to explore emergence, chaos, and complexity in a non-linear system.

• # Connecting to Cisco IPSec VPNs on Arch Linux

26 Feb 2014

In preparation to some travel abroad to Ivrea, Italy, I decided that I needed a secure way to connect back to my server at college. SUNY Geneseo is kind enough to provide a Cisco IPSec VPN into their heavily firewalled network and, with a little work, we can VPN in without an issue.

Figure 1: A generic visualization of the priciples of VPN tunneling. In this figure distict networks are able to connect with one another via the internet whilst preserving annonymity. Image by Ludovic.ferre licensed under Creative Commons Attribution-Share Alike 3.0 Unported.

## VPNC and the OpenConnect Client

Cisco provides a proprietary VPN client for users, however this application lacks official linux support, and remains unstable on Arch Linux. The open source community has created an alternative to the Cisco VPN client called the OpenConnect Client. Arch Linux has a package in the official repositories called openconnect To install, open a terminal and run

pacman -S openconnect


Once installed, we can configure and initialize a VPN instance using the vpnc command.

• # A Case for Public Data in Higher Education

23 Jan 2014

It has become commonplace for today's universities to release faculty evaluations to their students as a way to assist students in choosing classes. Recently, however, Yale decided that public access to this freely available information should not be allowed when they blocked university access to Yale Bluebook+, a student-made service to view faculty evaluations along with course registration information.[1][2] At my alma mater, SUNY Geneseo, our faculty evaluation are obscured from view as non-searchable pdf files never to be seen by students again. As shown in Figure 1, this approach has consequences.[3][4]

Figure 1: The decline of average SOFI response by semester. This chart shows a clear and startling decline of almost 40% in the last four years. I hypothesize that the response percentage for SOFIs is decreasing because the data collected has no impact on students in its current form.

When students are unable to see the results of their evaluations, participation drops. When participation drops, the data become less valuable and the cycle continues. On the flip side, if the data were to become useful again -- as a metric for students to choose future professors, perhaps -- we could restore the worth of the SOFI data. My question is, what would happen if this data were to be place directly in the hands of the people who need it the most, exactly when they need it?

• # Agent-Based Exploration of Plant-Pollinator Mutualism

08 Dec 2013

From as early as 1869, apiarists have reported a set of symptoms in which colonies lose many adult worker bees leaving behind large stores of food, brood, and even the queen. Colony Collapse Disorder, as described above, continued at a steady incidence rate of ~17-20% in the 1990s and early 2000s. The rate of CCD started to increase, however, in November of 2006 to between 30% and 90% (an admittedly large range).

Figure 1: A European honey bee Apis mellifera extracts nectar from an Aster flower using its proboscis. Tiny hairs covering the bee's body maintain a slight electrostatic charge, causing pollen from the flower's anthers to stick to the bee, allowing for pollination when the bee moves on to another flower. Image released into the public domain by John Severns.

Bees are an important component in the pollination of plants, particularly in modern agriculture where bees are known to pollinate over 120 different species of crop. Given that pollinators, such as bees, are known to develop mutualistic relationships with particular species of plants, Matthew Taylor, Andrew Patt, and I set out to create an agent-based model to explore how obligate pollination affects the dynamics of plant competition.

• # Types of time for simulations

22 Nov 2013

For the last few weeks a couple of colleagues and I have been modeling competition in pollinating plants under ecology professor Dr. Gregg Hartvigsen. In our particular research, a 2D spatial simulation consisiting of agents simulate plant and bee behavior. At face value, the model looks similar to cellular automata, but in this case the rules are slightly more complex.

Figure 1: An early run of our model's first version. This is a typical domination case, where one species out competes another. This version, however, is flawed in the evaluation of discrete time, in which some cells have reproduction bias.

After we started testing our model, we realized a massive mistake: we biased some plants over others when we handled time. This led to a question of if we should adapt our model to use continuous time or discrete time and what those two approaches would entail.

• # How challenging are Geneseo's classes?

02 Nov 2013

This week fellow student and fellow data analysis enthusiast Herb Susmann released student-reported SOFI data on courses at SUNY Geneseo, welcoming people to see what interesting relationships -- or lack thereof -- they could find in the data.

To that end, I downloaded the data, fired up R, and decided to compare how challenging students rated their classes. The individual course data was too narrow a data set, so I examined the data for classes in the natural sciences, the social sciences, and the fine arts.

• # The enzyme kinetics showdown

27 Oct 2013

Last week, I was asked to review a new case study for W.H. Freeman for their 7th edition of Lenninger's Biochemistry. The study involved using enzyme kinetics to discover the identiy of a poisoner using an unnamed inhibitor. Alas, I had a difficult time getting the correct maximum enzyme velocity and Michaelis constant.

Initially, I did the usual non-linear regression of the kinetics plot in R using nls, but that wasn't right. Next I tried a Hanes-Woolf plot because of it's relative accuracy at finding constants. As a last chance effort I made a Lineweaver-Burk plot which had the "correct" values.

Educationally, the Lineweaver-Burk plot is used as a mostly accurate determination of the kinetics constants while being easy to both construct and read. However, the differences in computed constants between a Lineweaver-Burk, Hanes-Woolf, and basic non-linear regression seems non-trivial. Just how different are they?