View on GitHub

Twitter Sentiment Analysis

Data Science I / BST 260

Download this project as a .zip file Download this project as a tar.gz file

Group Members



Needless to say, 2017 has been a turbulent year: nationalism, hate-crimes, xenophobic attitudes are on the rise and have become even more brazen. The infamous “Unite the Right” rally in the Charlottesville, VA sent shockwaves around the national stage, catalysed several difficult conversations, and escalated violence and the destruction of property. The term “Twitter Revolution” refers to the use of social networking sites by protestors and demonstrators to communicate civil unrest. Parsing data from Twitter (bytes of bigger conversations) can capture fleeting emotions and solidify networks within a subgroup. Social media in rallies has been cited as a potential model for the interactions that occur through conventional means. However, there is conflicting empirical evidence of the efficacy of the Twitter Revolution phenomenon.

In this project we attempted to codify and quantify the “Twitter Revolution” in Tennessee by using sentiment and network analysis. We analyzed tweets in a 50-mile catchment area surrounding Murfreesboro and Shelbyville during the Shelbyville White Lives Matter rallies and Murfreesboro Loves counter-protest from October 27 to October 29, 2017.


Data Methodology


We first set out to see if people in Charlottesville who were actively tweeting during the event were collectively organizing and either influencing or reacting to the event through their content. However, due to limitations of Twitter’s API, we had to use another protest for the basis of our analysis.


Using the twitteR package developed by Jeff Gentry, we accessed the Twitter Streaming API and obtained all tweets between 00:00:01 October 27, 2017 and 23:59:59 October 29, 2017. The data represents 65,955 different tweets from 22,209 unique Twitter accounts. To further simplify our analysis, we rounded time into 15 minute increments.


Stopwords, UTF-8 emojis, punctuation, replies (@), retweets, linefeeds, and URLs were removed from tweets using regular expression functions.



Using the tidytext R package, we used the following data sets were used for the sentiment analysis:

From our dataset of tweets, we used the afinn and nrc datasets (separately) to assign each tweet a sentiment(s), and then explore how the sentiments changed both quantitatively and qualitatively over time. In addition, building on the network analysis, we subsetted the tweets dataset by network neighborhood to explore the general sentiment for different neighborhoods over time.


A large disconnected network formed between October 27th and October 29th. Focusing on the largest component of the graph, we discovered that a significant portion of those interactions occurred during Saturday, October 28th, with the highest rate of new interactions, represented by a new Twitter account interacting with a Twitter account already in the largest component, occurring on October 28th. This demonstrates that Twitter users are responding to the protests in real time. Using the same network, we identified the most active and influential accounts using centrality and hub score as our determining factors. These identified accounts were incorporated into the sentiment analysis as a comparative measure.

Data Visualization


Sentiment Analysis

Network Overall This is the overall network representing every Twitter account that was active and all the tweets that were posted from 00:00:01, October 27th to 23:59:59, October 29th. The largest connected component can be found at the bottom of the network.

LCC_hubs The ten largest hubs are shown on the network. An interactive plot with hub scores and account centrality can is shown below.