Twitter Analysis: Trump vs. Biden

A Deep Dive Into the Twitter Followers of Two of America’s Most Divisive Political Figures Using Machine Learning Modeling, Entropic Analysis, and Data Visualization

By Teagan Johnson, Osip Surdutovich, and Nathan Hedgecock

Introduction:

Throughout the 2020 election season, Twitter was firmly entrenched as a venue for Americans to process campaign news and engage in political discourse. Journalists, politicians, and the public have used Twitter as an indicator of political trends by analyzing tweets, likes, hashtags, and other types of activity. At the center of this political explosion on Twitter are the two presidential candidates themselves: Donald Trump and Joe Biden. In this project, we take a deep dive into Trump and Biden’s Twitter followers. By utilizing data visualization techniques, entropic analysis, and machine learning modeling, we investigate the polarization between Trump and Biden followers, predict who a Twitter user is most likely to follow based on the accounts they follow, and analyze other significant trends.

Data Collection:

In order to create a prediction tool, we needed to create a matrix and a target array. A matrix is a set of vectors laid out in rows and columns. A target array is a corresponding vector that indicates which binary each row of the matrix is aligned with. They look something like this:

Matrix
Target Array

In our case, the 0’s and 1’s in the matrix represent whether or not a user (represented by each row) follows one of the most commonly followed accounts by Trump and Biden followers. The 0’s and 1’s in our target array represent whether a user follows Trump (1) or Biden (0). Each row of the target array corresponds to a row of the matrix. After we created the matrix of the users that follow either Trump or Biden, we created a matrix of users that follow both Trump AND Biden. The matrices and target array keep our data organized and will also be used to make predictions and different types of analysis.

To create the matrices and target array, we needed to gather data from Twitter. We needed to pull followers of Trump and Biden, then later pull who each of their followers were following. Each row of the matrix would represent a specific user that follows either Trump or Biden, and each column would represent one of the most commonly followed users by Trump and Biden followers. To create the matrix and target array, we needed A LOT of data.

Our original plan to gather data was to use the Twitter API. However, after being granted a free Twitter development account from Twitter, we soon discovered that the API had heavy restrictions on the amount of profiles that we could pull. Luckily, we came across a Python library called Twint that allowed us to pull data from Twitter without any limits.

With the use of Twint, we pulled thousands of Trump and Biden followers and saved them to our computers. To actually create the matrix and target array, we followed three steps:

  1. Split the followers into 3 groups: Trump followers, Biden followers, and people who follow both. Save each groups to our computer as a folder, and within each folder, every user’s following list would be saved as a csv file.
  2. Find the top 60 most commonly followed users between both Biden and Trump followers. Add both Trump and Biden’s top 60 lists together and remove duplicates to create a master list of the most followed accounts. Use this list to create the columns of our matrix. Our list ended up being 102 elements long.
  3. Iterate through each member of the groups and create a vector that is 102 elements long. Each column represents one of the most commonly followed accounts. If the user follows one of the top accounts, we insert a 1 in the column. If not, we insert a zero. This results in a “follower vector” for each follower of Trump and Biden. After creating each follower vector, we combine all of them to create the matrix and target array.
This is an example of a follower vector.

After pulling the Trump and Biden followers from Twitter and creating the matrices and target array, we’re ready to use data science tools to analyze the data.

Data Visualization:

In total, we ended up analyzing 1381 Biden followers, 1324 Trump followers, and 782 users that follow both. Our matrix had 102 columns to represent each of the top 102 most commonly followed accounts by Trump and Biden followers, resulting in a 2705 x 102 matrix.

In order to more fully understand the shape and size of our data, we created a series of graphs and histograms. Along with providing a solid foundation of our analysis, these graphs begin to show trends in our data.

**Note: For clarity, we will be referring to the top 102 most commonly followed accounts by Trump and Biden supporters as “the column accounts”. This is in reference to the fact that we use these accounts as the columns of our matrix.

Below is a histogram depicting the number of followers for each of the column accounts. Under the larger histogram, there are two zoomed in graphs of both ends of the spectrum: the top ten most and least popular column accounts.

The above distribution, depicting the number of followers per user, is heavily skewed right.
A zoomed-in look at the top ten most and least popular column accounts.

It appears that the first three column accounts (Obama, Kamala Harris, and POTUS) have many more Trump and Biden followers than the rest of the group. The lowest ten accounts of the column matrix appear to have around 50 Trump and Biden followers each. The distribution of Trump and Biden followers amongst the column accounts is clearly skewed right. This heavy skew implies that the connections within our data are similar to a Barabási–Albert network. Barabási–Albert networks are networks in which “we should expect a few nodes to be very highly connected, and the vast majority to have a smaller degree than the average” (Barabási). Most social media platforms, like Twitter, are described as having Barabási–Albert networks due to the relatively small number of people with lots of followers.

Next, we analyze the total number of accounts that are followed by the three different groups (Trump followers, Biden followers, and followers of both).

This bar graph shows the difference between the average number of accounts followed per user. Users following both Biden and Trump follow the most accounts on average.

On average, users that follow both Trump and Biden follow the most amount of column accounts while users that follow Trump follow the least amount.

Below is a histogram comparing the number of followers per user between Trump and Biden followers. The column accounts on the x-axis are organized by the number of Trump followers per column account in decreasing order.

This distribution shows the differences in the number of users following column accounts between Trump and Biden followers.

As mentioned above, however, Biden followers follow more people on average. Because of this discrepancy, the above graph is clearly going to show that Biden followers are following more people. To get a more “fair” look at the distribution of followers, we normalized the above histogram.

This distribution is the same as the previous one, but it’s normalized to correct for the fact that users following Trump don’t follow as many accounts.

Looking at the normalized distribution, there appears to be a roughly inverse relationship for the number of followers per column account between Trump and Biden followers. Near both edges, the difference between Biden and Trump followers for each account is massive. There are some “Biden spikes” where column accounts that are heavily followed by Trump are also heavily followed by Biden. Most spikes are correlated with accounts that are seemingly non controversial: NASA, BBC, NY Times, Cristiano Ronaldo, Justin Bieber. The leftmost and the largest spike, however, represents Obama’s account. The reason Obama appears in the top ten most followed accounts by Trump followers is probably due to the fact that he himself was the president and held one of the most public positions in America. Clearly, there are still many more Biden followers that follow Obama.

Below on the left is a zoomed in look at the histogram above. It depicts the top ten most followed column accounts by Trump followers in descending order. The histogram on the right is just the opposite. It depicts the top ten most followed column accounts by Biden followers in descending order.

The difference between the number of Trump and Biden followers per user tends to be significant. This is evidence supporting the claim that there is a high degree of polarization between Trump and Biden Twitter followers.

Shannon Entropy Analysis:

What is the Shannon Entropy (Kumar)?

Shannon Entropy is a number between 0 and 1 that essentially represents the level of chaos or surprise in any statistical observation. A simple example would be to take a fair coin and perform a coin toss. Since the coin has an even chance of landing on both sides, the entropy of each toss will be 0.5. This means that if you always expect the outcome to be heads, there will be a one in two chance that you are surprised by the result, or in other words, you predict incorrectly half of the time.

The general formula for Shannon entropy is:

H(x) = -i = 1nP(xi)*logP(xi)

H(x) is the entropy vector, which consists of the negative sum of the products of the probability vectors and their logarithms.

For our uses, we will rely on the weighted average entropies* (WAE) of each of the column accounts, which means the entropies of the accounts when controlling for the total number of followers per account. The WAE of the accounts represents the homogeneity of each split. For example, a team with a lower WAE will tend to be more polarized towards one candidate. By finding the WAE of each account, we will be able to figure out which accounts are the best indicators of follower orientation (Trump v Biden). Our team decided to classify a WAE below 0.90 as a meaningful value. This is a relatively standard cut-off point when analyzing WAEs. Below is the distribution of our WAE values for the 102 column accounts.

The distribution of AWEs is skewed left. Lower WAEs bespeak accounts that are good indicators of who a user follows. The account furthest to the left is Kamala Harris.

The teams are exponentially skewed towards a WAE equal to 1, meaning that the majority of the column accounts tend to have a lot of chaos (lower levels of homogeneity) among their followers. These accounts are not good indicators of a user’s follower orientation. However, there are a few accounts that have low WAEs, the lowest one being below 0.7 (KamalaHarris). This means that most users that follow Kamala Harris follow either Trump or Biden. Since Kamala Harris is Biden’s vice president, it seems reasonable that she is highly correlated with Biden.

After analyzing the WAE distribution of the column accounts, we found the accounts with the top ten highest and lowest WAEs.

On the left, we see the most effective accounts that are best at predicting who a user is most likely to follow. The right shows accounts that are not effective in determining who a user is most likely to follow.

We observe that among accounts with the lowest WAEs, there are several famous Democratic politicians. Among them: future vice president Kamala Harris, Barack Obama, and Michelle Obama. There were also prevalent conservative politicians like Sydney Powell, Lin Wood, and Trump’s son Donald Trump Jr. These distributions indicate that well-known politicians that are either right- or left-leaning are good indicators of whether somebody follows Trump or Biden.

The highest WAEs are made up of news channels, sports channels, comedians, National Geographic, and Twitter’s official account. The distribution of the highest WAEs shows us the accounts that are the least effective at indicating who a user follows (Trump or Biden). The highest entropy tables can once again be found in the table section.

Below is a split entropy tree of all the column accounts. A branch is formed after splitting the group of followers by the account with the lowest WAE.

This split entropy tree illustrates how the group of Twitter users is split up by column accounts.
This is a zoomed-in look at the first 2 splits of the group.

This tree shows the process of separation and demonstrates which teams have the lowest WAE. The initial split occurs at the top vertex of the image. The Kamala Harris account, with the lowest WAE, is used as a “split”. Users that follow her are separated into one group and users that don’t are separated into another. This process continues for each new group until a group gets less than 5 users. One of the values in each group is the entropy value. The groups with lower entropies contain users that are more homogenous (i.e. most users in the group follow either Trump or Biden).

Modeling:

One of our main goals was to use the column accounts to predict whether a user follows Trump or Biden. To do this, we fit our matrix and target array to a Bernoulli model from scikit-learn. We chose a Bernoulli model because it’s a relatively simple model that works when there are only two outcomes. In our case, our possible outcomes were “Trump follower” or “Biden follower”; but not neither. Using our matrix and target array, we trained our model on roughly 90% of our group of 2705 unique followers. The remaining 10% of the unique followers were used for testing the accuracy of the model.

The bottom accuracy in the figure is the accuracy of our model. The top accuracy in the figure is the accuracy of our model on the data that it was trained on.

Above are the accuracy, precision, recall, and F1 scores from evaluating our model on both the training and the testing data. For our model, roughly half the data points follow Trump and half the data points follow Biden, so there is little to no class imbalance. Because of this, accuracy is the most effective way to analyze our results (rather than precision or recall). One example in which we may want to consider precision or recall would be if our model was being used for phone banking. For example, if the goal was to sway all possible Trump supporters to vote for Biden, the recall would be most important so that no Trump voters slipped through the cracks.

Our model has a 72% accuracy (the lower accuracy figure). Just how accurate is this? Well, since our data is binary (users follow either Trump or Biden), the baseline accuracy is 50%. Our model is 22% higher than baseline. If you were to look up if this is a good accuracy on Google, you wouldn’t find a clear answer. In the world of machine learning, there are some situations where an accuracy that’s 2% better than the baseline is wildly successful. There are also situations where an accuracy that’s 40% better than the baseline is disappointing. In our case, we scraped just under 3,000 users and trained our model by using 102 column accounts. Because the size of our data is relatively small, our data was more likely to have higher variance. Even so, our model was 22% more accurate than the baseline accuracy. This shows 2 things: first, our model produces reliable results, even with a small sample size. Second, with more data points and more users, our model could predict at an extremely high accuracy.

After creating the model, we fit it to our matrix containing the followers of both Trump and Biden to see if there were any noticeable trends. We plotted the distribution of probabilities that a follower of both accounts would follow Trump.

This distribution shows that users following both Biden and Trump tend to follow more accounts correlated with Biden.

First of all, it’s clear that a majority of users following both Trump and Biden follow more accounts correlated with Biden. But what does this mean?

It can be concluded that the users towards each edge of the distribution represent people who are much more likely to follow Trump or Biden. The amount of users more likely to follow Biden greatly outnumbers the amount of users more likely to follow Trump. To put this another way, Biden followers are much more likely to follow accounts correlated with Trump than Trump followers are to follow accounts correlated with Biden.

Basically, if you take two Twitter users (one that follows Trump and one that follows Biden), it is more likely that the user following Biden would be following Trump-correlated accounts than the other way around. This indicates that Trump followers are less likely to follow accounts with different political orientations than their own.

Accuracy Improvements:

Although our model is already fairly accurate, it’s possible for it to be even more accurate. If we only include users following a minimum amount of column accounts, the accuracy of our model improves significantly.

Because we only use the accounts that users follow to make predictions, it’s difficult to accurately predict who a user is more likely to follow at an individual level. For example, it’s possible that one specific user doesn’t follow any of the top 102 column accounts we use to make predictions. If this were the case, our model would not be able to predict anything about that specific user. An example is provided below:

Twitteruser123 doesn’t follow any of the top 102 accounts, so our model doesn’t have any information to predict.

There are, however, many users that follow a significant amount of accounts. We are able to accurately predict who these users are most likely to follow. Another example below:

Kamala Harris follows 18 top accounts, which shows that if users follow enough column accounts, our model is accurate at an individual level.

As you can see, our model is very accurate when predicting who Kamala Harris follows, but fails to make an accurate prediction for twitteruser123. This illustrates the fact that at an individual level, our model is not accurate if users don’t follow enough column accounts. However, at the aggregate level (combining all predictions at the individual level), the accuracy of our model is fairly accurate (72%). But, because users that don’t follow enough column accounts are included in our aggregate predictions, the accuracy is impacted.

As you can see below, as the minimum amount of column accounts followed per user increases, the accuracy of our model also increases. The number in the center of each bar is the total number of users included in the model.

As users are required to follow more column accounts, the accuracy of our model continues to get higher. The number in the center of each bar is the total number of users satisfying the minimum requirement. Our model’s accuracy maxes out at 99.9% when users must follow at least 30 column accounts.

The bar on the far left represents the accuracy of our model when we include every user without taking into account how many column accounts they follow. The second bar from left represents the accuracy of our model when we only include users that follow at least one of the column accounts. Each consequent bar has a higher and higher minimum following requirement, along with a lower total number of users. As you can see, the accuracy increases the higher the minimum following requirement gets. When we require that users follow at least 30 column accounts, our model is 99.9% accurate. However, there are only 42 users that follow at least 30 column accounts. If we only included users that followed a minimum number of column accounts, our model would be significantly more accurate. We, however, decided to include all users regardless of the number of column accounts they follow in order to avoid biases in our data. It’s possible that users following Trump are just less likely to follow other accounts, so by implementing a minimum following requirement, the ratio of Biden followers to Trump followers would become unbalanced.

Discussion:

In our project we were able to examine and analyze many trends and relationships between Trump and Biden’s Twitter followers and the column accounts. We discuss the implications and meaning of our analysis.

General Twitter Users:

Using our model (72% accuracy), we were able to accurately make predictions about whether a user is more likely to follow Trump or Biden on Twitter. Users that follow column accounts with lower WAEs are more likely to follow either Trump or Biden. According to our WAE distribution, users that follow Kamala Harris, a left-leaning account, are more likely to follow Biden, while users that follow Sydney Powell, a right-leaning account, are more likely to follow Trump. These accounts, and the people running them, are considered polarizing due to their political orientation, and according to our analysis, they are effective indicators in determining whether a user is more likely to follow Trump or Biden.

Can we accurately predict the political orientation of users based on if they’re more likely to follow Biden or Trump? No! It’s interesting to hypothesize about users’ political orientations, but the scope of our project is only focused on the world of Twitter. The results of our project are not meant to be relied upon outside of Twitter analysis. However, we do believe that the results of our project warrant more investigation and open a door to connect users’ political orientation with their Twitter activity. In the future, we would love to see research that connects Twitter activity to political orientation, elaborating on our prediction of who a user is most likely to follow on Twitter.

Popular Twitter Accounts (Column Accounts):

By using our weighted average entropy data and distribution, and by analyzing our data visualizations, we are able to see which accounts are highest correlated with Biden and Trump’s accounts. As mentioned above, the Kamala Harris and Sydney Powell accounts are excellent indicators for whether a user follows Trump or Biden. Because of this, we can conclude that Kamala Harris’ account is heavily correlated with Joe Biden’s account.

Is it logical to extrapolate the correlations outside the realm of Twitter and into real life? Since Harris is Biden’s vice president, it’s quite obvious that the two people are not only heavily correlated on Twitter, but are also very connected in real life. However, it’s dangerous to conclude that all column accounts with low WAEs are highly correlated with Trump or Biden. For example, Ariana Grande has the fourth lowest entropy, indicating that her account is an excellent indicator of whether a user follows Trump or Biden. Grande’s political preferences have been well-documented, and it’s safe to assume that her account is heavily correlated with Biden’s. But in real life, it seems highly unlikely that Ariana Grande and Joe Biden are well-connected. Therefore, just because many users on Twitter follow both accounts doesn’t necessarily mean that the people/entities running the accounts are connected in real life.

Polarization:

Does our data suggest that there are high levels of polarization between followers of Trump and Biden’s Twitter accounts?

Our entropy analysis shows that certain accounts are better at predicting who a user follows (Trump or Biden) than others. This suggests that there is a degree of divisiveness between Biden and Trump followers for specific accounts like Kamala Harris or Sydney Powell. But is this enough to conclude that Trump and Biden followers are polarized?

Well, yes and no. We find clear evidence of specific interest groups between many Trump and Biden followers (Kamala Harris or Sydney Powell). However, we cannot define polarization between Trump and Biden followers, rather we can only begin to discern the polarizing nature of specific accounts.

We are able to find levels of polarization among column accounts with lower WAEs, thus indicating which specific accounts seem to be the most polarizing between Biden and Trump followers on Twitter. Accounts like Kamala Harris, Sydney Powell, and Barack Obama have very low WAEs, demonstrating their polarizing nature.

Along with the column accounts’ WAEs, our histograms and data visualizations indicate large differences between the proportions of Biden and Trump followers for specific accounts. Below is the normalized histogram of the number of Trump vs. Biden followers (again).

Again, the differences in Trump vs. Biden followers for specific accounts are clear, especially towards the edges of the graph.

Our entropy analysis and data visualization give us reason to believe that certain accounts are more polarizing compared to others. But more data is needed to answer the question of how polarized Trump and Biden followers are. And even more data is needed to answer the question of how polarized Trump and Biden followers are in real life. Our project only indicates which specific accounts have high levels of polarization.

Why Does This Matter?

If you haven’t found your own answer to the question above, here’s how we view the implications of our analysis.

First, our analysis shows that users following Trump are much less likely to follow accounts that are not correlated with Trump than users following Biden are to follow accounts not correlated with Biden. This indicates that Trump followers generally don’t seek information from different viewpoints than their own. As people interact with like-minded others, their political views and attitudes become more extreme which has dangerous implications for democracy (contentious elections, deadlock in Congress, etc.) (Kuta). We hope that our analysis spreads awareness of the prevalence of following like-minded accounts.

Second, our analysis has the ability to give certain accounts a better idea of the type of people that follow them. For example, by looking at our data, BBC would realize that most of its followers are users that follow accounts correlated with Biden.

Third, our analysis gives Twitter an opportunity to make their product even more appealing and provide more opportunities for users to create larger networks. For example, Twitter could help users find more accounts that they would be interested in by recommending accounts correlated with who the users already follow. This would increase activity levels on Twitter and would benefit both Twitter and its users.

Future Directions:

There are improvements that can be made in order to increase the accuracy of our model and ultimately strengthen our analysis. First and foremost, analyzing a larger set of data (i.e. scraping more followers) would yield more accurate results. Second, using not only the accounts that users follow as data points, but also using tweets, likes, comments, and other types of Twitter activity would not only yield more accurate results, but it would also allow us to make even more detailed and nuanced conclusions.

We hope that our project opens doors for more research to be done on the relationship between a person’s Twitter activity and their political orientation. We wonder if it will ever be possible to accurately predict a person’s political orientation by analyzing Twitter activity.

We also hope that more Twitter analysis projects like ours will be done in the future. We encourage others to use our code to compare Twitter users and branch out into other areas of focus.

Code Repository

If you are at all interested in investigating more trends related to the ones we’ve uncovered in our project, please visit our github repository!

In the repository, there’s a file (youTry.py) that allows you to test our model yourself. Enter the name of a Twitter user and see what our model predicts! (As mentioned above, our model may be inaccurate at the individual level if the user doesn’t follow enough column accounts!)

Acknowledgements

Thanks to Lucian Leahu, our professor, who helped us with the direction of our project and taught us so much about the field of big data. Big thanks to the creators and contributors of the incredibly useful Twint library. As always, thanks to StackOverflow and the open source community for providing solutions to almost every problem. And lastly, thank YOU for reading about our project and for inspiring us to continue telling stories with data!

Citations:

[1] Albert-László Barabási, Network science (2013), royalsocietypublishing.org/doi/full/10.1098/rsta.2012.0375

[2] Amit Kumar and Srinivas Peeta, Entropy Weighted Average Method for the Determination of a Single Representative Path Flow Solution for the Static User Equilibrium Traffic Assignment Problem (2014), www.sciencedirect.com/science/article/abs/pii/S019126151400191X

[3] Sarah Kuta, Talking with like-minded people creates extreme political views, CU Boulder research finds (2016), https://www.dailycamera.com/2016/09/30/talking-with-like-minded-people-creates-extreme-political-views-cu-boulder-research-finds/

Math and computer science student @ Carleton College. Passionate about using data to reach meaningful conclusions.