Nicholas A. Yager

Analyzing Geneseo's Yik Yak Community

13 Feb 2015

Due to the nature of the data analyzed this post contains strong language.

The pseudo-anonymous messaging app Yik Yak has been making waves in high schools and college communities for the last year, with officials and community claiming that the application promotes bullying and hate speech. In March of 2014, Dr. Keith Ablow wrote an opinion piece for Fox News stating that "Yik Yak is the most dangerous app [he'd] ever seen." SUNY Geneseo, a small rural college with a campus population near 5,000 students, has a burgeoning Yik Yak community with hundreds of users. Like other college campuses, Geneseo's campus is in the debating the social pros and cons to the type of anonymous forum that Yik Yak provides.


Figure 1: Peter Steiner's famous cartoon published by The New Yorker on July 5, 1993. Steiner's comic showed the general transition of the Internet out of the hands of purely government and academic use into the hands of everyday individuals. It also touches on the pseudo-anonymous aspects of Internet culture providing a disconnect between individual identity and community membership.

Regardless of the rhetoric and vitriol surrounding the app and cherry-picked examples to be lambasted by the media, very few people have put their money where their mouth is in terms of the content on Yik Yak. Is Yik Yak really as socially dangerous as Dr. Ablow would have people believe, or is it simply a diverse and vocal community like other social networks?

The Data

With an interest in quantitatively analyzing the content posted in Geneseo's Yik Yak community, I collected all yaks posted between November 14th, 2014 and March 5th, 2015 using an unofficial Yik Yak API created by djtech42. All of the data and analyses are available on Github. My data set includes the yak ID, the yakker ID, the post content, the score, the latitude and longitude of the yak, and the time of the yak. The Yak and Yakker IDs are consecutive values with no de-anonymizing capabilities as far as I can tell. Additionally, the latitude and longitude for each yak is randomized. That leaves the content of the yaks for use in direct analysis.

For this post, I am interested in three things:

  1. What are some typical Geneseo yaks?
  2. Are yaks more positive or negative?
  3. How do Geneseo's yaks compare to tweets?

What are Geneseo's Typical Yaks?

Some useful characteristics to examine include character and word counts per Yak, and common word usage. An examination of the distribution of character and word counts presents a skew right distribtion, as per usual for discrete numerical ratios. As shown in Figure 2, a typical Geneseo yak contains between 29 and 80 characters (median 49), and between 6 and 16 words (median 10). This relatively low character and word count is similar to the character limits enforced by microblogging services like Twitter.


Figure 2: Distribition of character counts and word counts in Geneseo Yaks. The distribution of character counts and word counts is skewed right similar to a gamma or Weibull distribution. The median character count is 49 characters with an interquartile range of 51 characters per yak. The median word count per yak is 10 words, with an IQR of 10 words per yak.

Text mining requires data processing steps to produce to remove unusable information. Using the tm R package, I stripped the documents of extra whitespace, punctuation, and stopwords, and converted all of the characters to lowercase. With this crafted data, a term document matrix can be constructed. The term document matrix lists the number of occurrences of each word in each document. Using the summation of each row a word cloud can be constructed.

A word cloud from the most common words generated in a term document matrix can be seen below. The most frequent words using a large font size and the dark blue color, decreasing the scale of each word accordingly. As shown in Figure 3, many of Geneseo's yaks include references to final exams, Geneseo, food, drinking, recreational drugs, sex, and relationships.


Figure 3: A word cloud generated from 11,589 yaks. Word occurrence frequency is depicted by both size and color. Stopwords, numbers, and punctuation were removed for the purpose of the analysis. Most yaks include references to classes, drinking, sex, breaks, and tests. Other topics of note are food, coffee, and [SUNY Cortland][cortland], a regular butt of jokes in the Geneseo Yik Yak community.

Term frequency is not always the most accurate way of determining the importance of a particular word or phrase is in the Geneseo Yik Yak community. Instead, we can A word cloud from the most common words generated in a term document matrix can be seen below. use term frequency-inverse document frequency to identify the keyword of each Yak. To calculate the inverse document frequency, for each term the number of documents in the corpus is divided by the number of documents that term is seen in. The tf-idf provides the word in each document which is unique to that document; the keyword. The most important keywords in the corpus can be determined by counting the occurances by counting the number of times a particular tf-idf keyword is used. Figure 4 includes a word cloud of keywords scaled by their frequency as keywords. In this case, it common keywords discuss work, sleep, sex, and campus events such as student protests.


Figure 4: Keywords found in Geneseo yaks based on tf-tdf. Keyword occurrence frequency is depicted by both size and color. The keywords in this corpus identify the subject matter of the yak more clearly. For example, the most common keywords concern sex, work, sleeping, and events that occurred on campus such as student protests.

A direct comparison of the words returned by both the term frequency approach and the tf-idf approach shows the difference in the keywords of Geneseo's yaks. Term frequency outlines common words, but tf-idf picks the important words used in each yak. Generally speaking, the tf-idf keywords provide nouns and adverbs whereas the term frequency results provide more parts of language.

Table 1: Comparison of keyword results for term frequency and tf-idf.

Rank Term Frequency Approach TF-IDF Approach
1 finals fuck
2 can sleep
3 people spring
4 want sex
5 fuck cuddle
6 one love

Are Yaks more positive or negative?

Using a naive Bayesian classifier, I performed sentiment analysis on Geneseo's yaks. Using emoticons, a training set of 1,200 yaks was classified as being either positive, negative, or neutral. This training set was split into 4-grams and then used to weight the prior probabilities for future yaks. Using this approach, I classified the sentiments of my yak corpus.

As shown in Figure 5, the majority of Geneseo yaks are neutral (58.1%), with the second most sentiment being negativity (30.2%). Generally, the strongest identifiers of negativity were sad emoticons such as :-( and :'(, as well as the words "finals", "hate", and "violated". The best predictors of positive sentiment were the emoticons :-), and :'D, as well as the phrases "love", "cute", "mvp", and "sex".


Figure 5: Yak sentiment analysis. A naive Bayesian classifier was trained on 878 emojis and emoticons which were manually classified as positive, neutral, and negative. A corpus of 1,200 yaks were successfully classified. From the classified yaks, 30.2% were classified as negative, 58.1% were classified as neutral, and 11.7% were classified as positive.

It should be noted that the Bayesian classifier is a simplistic method of identifying sentiment, and the relatively small corpus size makes sweeping inference dubious. Having that been said, there seems to be a considerably greater number of negative posts than positive posts.

I believe that the disparity in sentiment can be explained by D. M. Penderson's work outlined in Psychological Functions of Privacy, in which he posited that anonymity allows for users to experience recovery, catharsis, and autonomy. Recovery being a sense of rejuvenation in the refuge of anonymity, catharsis as a form of emotional purging that can be either negative or positive, and autonomy in which people can experiment with new social behaviors without fear of social consequences.

Geneseo's Yik Yak community exhibits at least the later two of these activities. The appearance of "finals" and "work" in the keyword list suggest that many Yaks are used as a form of catharsis, in which Geneseo students vent their frustrations about their professors and finals. Yaks also harbor open discussions about sexuality and gender which are typically eschewed in western society, providing an opportunity for yakkers to test both the social waters and their sexuality. Furthermore, it has been suggested that catharsis, recovery, and autonomy can be beneficial to an individuals well being. With that in mind, Yik Yak may be providing a much needed outlet for Geneseo students to openly vent their frustrations without compromising their identity.

How to Geneseo's Yaks compare to Tweets?

Data in and of itself is nice, but it can be more telling if there is another data set to compare it to. In this case, I compared the sentiments of Yaks to a similar microblogging service: Twitter. Twitter, like Yik Yak, is a communication platform that limits messages to 140 characters. Twitter, however, is anything but anonymous and does not geofence users into defined communities.

To perform my basic sentiment analysis on tweets, I analyzed the corpus of tweets collected by Sentiment140, a sentiment analysis project by Alec Go, Richa Bhayani, and Lei Huang from Stanford University. A basic analysis of character and word counts, as shown in Figure 6 suggests character and word count distributions similar to Yaks. The decreased number of characters per tweet is due to the strict character limit on Twitter.


Figure 6: Distribution of character counts and word counts in tweets. The distribution of character counts and word counts is skewed right in a manner that is more platykurtic than the Yak distribution. The median character count is 69 characters with an interquartile range of 61 characters per tweet. The median word count per tweet is 13 words, with an IQR of 11 words per tweet.

Using the techniques discussed above, I also generated representative term frequency and tf-idf word clouds for the tweets processed. As seen in Figure 7, we see many references to time, such as "morning", "day", "today", and "now". There are also sentiment words such as "good", "bad", "great", and "sorry". Unfortunately, the term frequency words shown in Figure 7 do not lend much information about the content of the tweets being analyzed.


Figure 7: Term frequency keywords for the Twitter corpus. Keyword occurrence frequency is depicted by both size and color. From a corpus of 5,000 tweets, we see many references to times, such as "now", "day" and "morning", as well as things like "good", "work", and "home".

The use of tf-idf keywords is not as direct for the Twitter corpus as it is for the Yik Yak corpus. The keywords generated by tf-idf include references to people, such as "friends", and "mothers". There are also references to "work" and "school", much like the Yak corpus. An interesting inclusion is "quot", which is a short hand result from URL encoding as well as its use as a hashtag.


Figure 8: Keywords found in tweets based on tf-tdf. Keyword occurrence frequency is depicted by both size and color. The keywords in this corpus identify the subject matter of tweets more clearly than the term frequency approach, but not perfectly. For this corpus the most common keywords concern people, work, school, and the events of people's lives.

Using the sample emoji and emoticon classification system, I classified the sentiment of 5,000 tweets and tabulated the number of negative, neutral, and positive sentiments next to the results gathered for Yaks. As shown in Figure 9, there are generally more posts with a classifiable sentiment in the Twitter testing data set than there is in the Yik Yak data set. This could either be a result of the sampling methods used or a byproduct of the extensive use of rhetorical and sarcastic content on Yik Yak.


Figure 9: More tweets have positive sentiment than Yaks. A naive Bayesian classifier was tested against a sample set of 5,000 tweets from 2009. Of the classified tweets, 35% were negative, 28% were neutral, and 37% were positive. This is a much greater proportion of positive sentiments in the Twitter corpus than the Yak corpus.

Concluding Thoughts

I think that Yik Yak can be an interesting new environment to study social interactions without the social pressures seen in interpersonal communications both in real life and in other social networks. As Geneseo's Yik Yak community grows and more data is collect, researchers may foster a better understanding of how people behave when they are able to broadcast anything to the general public. Using machine learning, we may be able to study the occurrence of hate speech and cyberbullying, examine the speed at which localized memes spread, and perhaps observe the formation of a new social structure at SUNY Geneseo.

On the surface, it looks as though Yik Yak is a community of venting college students filled with academic stress and hate, but below the murky deep is a realm of social experimentation and the need to be heard above among the masses. Having personally read hundreds of Yaks, I can say first hand that media outlets are exaggerating the prevalence of hate on Yik Yak. Like any community, Geneseo's Yik Yak self regulates its content: downvoting unsavory opinions and incendiary into oblivion. Yik Yak is not dangerous; Yik Yak is a litmus test for real social issues college students deal with every day.

Return to home...