The topic of the September joint Python Buffalo/Data Science meet-up is data scraping. To finish up our conversation of how we can use python to scrape data from public sources, I presented a short slide deck on the ethics of web scraping. The general thesis is that it very much depends on what you are doing and how you do it, and that in the end we should all strive to be good members of the data community. Presentation.
Clustering has become an everyday process for grouping together observations based on similar factors. This is particularly true when working with spatial data. For some of my ongoing research into applying spatial Statistics to fluorescence microscopy, I've been applying DBSCAN to binary images of fluorescence-tagged chromosomes to localize chromosomes. The Scikit Learn Pnython library provides a blisteringly fast DBSCAN implementation that can cluster 78 million observations in 6 seconds.
As I continued working with the algorithm, I started to think that it would be interesting to see the process unfold step by step for a set of data. To that end, I've created an annotated step-by-step guide to how DCSCAN clusters data.Read more...
Starting in 2011, Kodak brought to market the Kodak Pulse line of digital photo frames. In addition to SD card and USB support, this line of photo frames had an email address which could receive image attachments, store the images on Kodak's servers, and display the images hassle free on the digital photo frame. While this feature is a boon for people who like to receive photos from friends and family with minimal latency, there is one very important feature missing -- the ability to download these images in bulk.
While it was possible to manually download each image using a web browser, this is not an acceptable means for backing up images, particularly if your photo album contains thousands of photographs. Instead of trying to download all of these images manually, I decided that a programmatic solution must exist. To that end, I created a bulk image crawling script using the python library scrapy.Read more...
A long standing tradition in scientific research is to keep detailed notes on everything as it happens. This studious attention to detail not only makes analysis and paper writing much easier, but also serves as a record of exactly how an experiment was performed should it need to be repeated in the future. By looking though a lab notebook, an experiment can be repeated exactly, and results can be verified.
Although I have since moved away from the laboratory, I still keep detailed
records of my work in a series of markdown files detailing the steps taken
as I perform data analyses and develop software. Over the last few months,
however, I've found that the volume of my notes has grown to large to simply
grep for keywords.
To make it easier for me to find project- or task-specific development notes, I developed a Rust-based CLI tool called Rememberall, which uses term frequency and Bayesian inference to retrieve documents relevant to a query of keywords.Read more...
It is a common theme in non-linear modeling that small perturbations in initial conditions can result in massive deviations in the outcome of a simulation. In 1969, Nobel laureate Thomas Schelling explored how even small biases can have large sociological effects. In his paper "Models of segregation", Schelling described how a model in which a preference that one's neighbor's be of a specific mixture can lead to total segregation regardless of intent.
Using this concept as a starting point, I developed a Java-based application to simulate a Schelling segregation model for an arbitrary number of races (n>=1).Read more...