This year I had the absolute pleasure to present a workshop at Coalesce 2022 in the heart of New Orleans. Along with Wasila Quader, I presented advanced uses of Jinja and dbt Macros to perform dynamic modeling including: run result storage in a data warehouse, dynamic value lookup in models, and leveraging model metadata in macros.
If you'd like to follow along at home, your can use the Get More out of Your DAG dbt project to get started!
Read more...
2020 was an odd one. There was jubilation, and there was grief. There was frustration, and there was pride. While I have always been one to reflect, 2020 is the first year that I've collected a full year of mood tracking data. In this case, the data can paint a very accurate portrait of the rollercoaster ride that was 2020.
Figure 1: Daily average mood heatmap for 2020. Mood data was collected at multiple times per day on a 1 (Awful) to 5 (Rad) scale and aggregated at a daily grain. Significant days -- highlighted using red borders -- indicate the day my daughter was born (2/6/2020) and the day I was diagnosed with a bone tumor (10/13/2020).
Read more...
This summer I learned a new card game called "To Hell and Back". Similar to "Oh Hell" and "Rats!", "To Hell and Back" is a trick-taking card game where you bid the number of tricks you intend to take, and you must take exactly that number of tricks per hand in order to win. Bid correctly, and you earn your bid and 10 extra points. Lose your bid and you get zilch. Unlike its other variations, "To Hell and Back" starts with all players being dealt one card, and each hand the number of cards dealt per hand increases until we hit our maximum for all players. In the case of one-card hands, the differences between success and failure are luck of the draw and careful bidding.
Figure 1: Total non-sequitur, but I learned how to shuffle cards while playing "To Hell and Back". I've only become decent at shuffling cards this last year after a considerable amount of practice. Now I can bridge in addition to a ruffle shuffle!
I've become fascinated with this one-card case. To better understand the odds of winning this specific scenario, I've constructed a hand simulator, wherein a group of players are dealt one card, and the winner is determined. Using this simulation, we can understand just how likely each card and position is to win a hand.
Read more...The topic of the September joint Python Buffalo/Data Science meet-up is data scraping. To finish up our conversation of how we can use python to scrape data from public sources, I presented a short slide deck on the ethics of web scraping. The general thesis is that it very much depends on what you are doing and how you do it, and that in the end we should all strive to be good members of the data community. Presentation.
Clustering has become an everyday process for grouping together observations based on similar factors. This is particularly true when working with spatial data. For some of my ongoing research into applying spatial Statistics to fluorescence microscopy, I've been applying DBSCAN to binary images of fluorescence-tagged chromosomes to localize chromosomes. The Scikit Learn Pnython library provides a blisteringly fast DBSCAN implementation that can cluster 78 million observations in 6 seconds.
Figure 1: Real time DBSCAN clustering of two sets of normally distributed points in a field of noise. A JS implementation of DBSCAN classified sets of two-dimensional coordinates as being either noise or one of two (or more) clusters. As a general warning, the data used for this example are randomly generated on page load, so it's possible to identify more than two clusters in this dataset due to the non-deterministic nature of both the data and DBSCAN.
As I continued working with the algorithm, I started to think that it would be interesting to see the process unfold step by step for a set of data. To that end, I've created an annotated step-by-step guide to how DCSCAN clusters data.
Read more...