Predicting Political Orientation from social media

I recently discovered Pew Research’s extensive analysis from 2021, where they surveyed over 10,000 folks and probed their political and world views. Clustering people based on the synthesis of all these issues reveals nine (9) orientations that represent loose coalitions of people on the left and the right in the United States, plus a small but growing group of “stressed sideliners” in the middle.

Pew repeated this in 2017, and again in 2014, getting similar results, and have data going back to 1987. The Pew “political typology” is thus a stable and robust way to categorize people. I decided to use it to decode all of the content on social media into a plot of how much each person aligns with each group. For example:

This shows that most Biden (AKA @POTUS in 2023) tweets align with topics and sentiment shared by those in the progressive left, and to a lesser extent the establishment left, Democratic mainstays, and outsider left – as defined by Pew’s research. In contrast, Sean Hannity’s tweets align with all of the other groups. This analysis deals exclusively with twitter in 2023, but I will apply it to spoutible as soon as their data is available via an API. For any dataset, it can reveal to what extent a person is living in an echo chamber.

How it works

Scraping: First, I ran a twitter network mapping algorithm (a python program) that recursively explores a social network, branching out from a few seed accounts. To understand what conservatives talk about, I started with accounts like FoxNews, seanhannity, OANN, and NEWSMAX. If you wanted influencers from the center-right (including some never-Trumpers) you would start with davidfrum, BillKristol, BulwarkOnline, and ProjectLincoln and the like. I scraped networks all along the 9-point spectrum, trying to capture the influencers for each of the Pew groups. This resulted in about 1000 influencers and 2 million tweets of training data.

Topic modeling: I used embedded topic modeling on all of this content to come up with 1 of 100 topics that may apply to any given tweet. In the final version, I actually did this twice, with two slightly different training sets, and picked the best of topic for each tweet from either set.

Sentiment analysis: There are many trained models for “reading” text and deciding if it is positive, neutral or negative. However, not all models are good. After some trial and error, I ended combining the “advice” of 5 different models in order to decide whether a tweet was positive or negative:

Flair – a solid TextClassifier for POSITIVE or NEGATIVE statements
TextBlob – another sentiment tool that scores on polarity and subjectivity. Turns out the most polarizing tweets are also highly subjective.
Vader from NLTK – a logical sentiment analysis that parses complex negation (e.g. I find your lack of faith disturbing). Politics is full of these sorts of sentences.
Huggingface model for political sentiment, trained on US senators’ tweets (cardiffnlp/xlm-twitter-politics-sentiment)
Certainty of the statement (a python sentence-level certainty_estimator, usually for academic texts)
Republican or Democrat tweets (from huggingface)

In all my past work I’ve relied on one of these, but twitter content is full of doublespeak, code-switching, sarcasm, and latent meanings. It required them all. The Republican/Democrat tweet model was particularly bad on a per-tweet level. It said, “These pretzels are making me thirsty!” was 99% likely to be from a Republican. But in the aggregate it can provide a clear signal about whether the account as a whole leans Republican or Democrat. I ended up tuning a composite model that uses the five sentiment models together along with topic modeling and the results of the 2021 Pew Political Typology survey to categorize people along the political spectrum.

Mapping Pew Survey Results: About a third of the 100 discovered topics in the model were political and mapped to one of the Pew survey questions. In these cases, I could infer from the percent of people who supported each statement whether a positive tweet on that topic would also align with the views of a particular group. For example, Pew finds that 85% of Faith and Flag conservatives are angry about immigration policy, and so for each negative tweet from an account about immigration policy, I can award “85 points” to the Faith and Flag group. Likewise, because 98% of Progressives feel the opposite way, I can award 98 points to positive tweets about immigration policy.

I repeat this idea for each of the 30-40 polarizing topics across the 9 groups, and the result is a plot that roughly gives you an idea of where an account lies on the spectrum.

Many of the remaining unranked topics were news themes that didn’t have a corresponding Pew question, or typical social media chatter. The most common topic for some accounts was named ‘gonna-lol-sh*t-f*cking-h*ll-gotta-f*ck-dude-wanna-‘ based on the most common and unique words). Many of these ignored topics were influencer self-promotion-speak, such as:

'piece-column-excellent-brilliant-amazing-interesti' 'book-wrote-reading-books-writing-thread-write-podc' 'episode-podcast-stories-conversation-interview-ser'

Overall it does a reasonable job, and can be a start for others to build tools on. Let’s look at some examples.

Results

Are news media outlets biased?

Let’s look at a few that claim to aim for the reporting down the middle.

News outlets are different from individuals in how they use social media. They are dominated by topics that are more about news events than opinion — topics that get ignored in this model because they don’t map to Pew Research’s political questions. Regardless, there’s enough FoxNews and NoSpinZone content to assign them a clear partisan status. Many of the big newspapers appear close to an even split across the political spectrum. But even there, in nearly every case, views of the Progressive Let and Establishment Liberals appear to be underrepresented.

What does this mean for the predictive model? By itself, this might indicate the model is not very accurate. But when you look at all the different types of twitter accounts, and how this same model works against them all, it is better in general than many models on Huggingface (that are more narrowly trained). Similar types of accounts give consistent results, showing a similar skew.

How well does it predict views of political pundits?

It captures those on the right very well, but struggles to capture pundits on the left. Look up @ChrisLHayes and @mehdirhasan using the interactive app, and you’ll see they appear to be centrist when they should be Progressives. I need to capture more Left-wing rhetoric in tweets to tune the model.

I suspect there’s a good reason why Left Wing Pundits need more content than right wingers: Right Wing Social media appears to be far more centralized in who holds the megaphone, and far more consistent in how they frame issues. They use their own keywords (coded to mean one thing to their base while communicating something else to The Rest) and they produce a uniform message. This makes the language particularly easy to topic-model, compared to the more diverse and fractured Left, who have many voices saying different things, using different words, arguing about the nuances of how to push on similar issues. Pew Research finds a whole group of people in the middle – the Outsider Left – who vote as Democrats but don’t identify with the Democratic party. They even see it as the enemy, in a sense. So it shouldn’t surprise you that topic modeling the Left’s messages requires more training data.

Politicians

Predicting which political candidates will appeal beyond their base

There’s a definite shift in how a politician appears when he/she is seeking national office. Look at Ron Desantis, for example. The difference between his governor voice and candidate voice is striking. Most of his future presidential campaign rhetoric is aimed at the left, especially the Outsider Left and Democratic Mainstays. It also indicates that he is ignoring the Populist Right and Ambivalent Right, but is strong with Faith and Flag Conservatives. If he succeeds as a candidate, it will be along different lines than Trump.

A useful tool would tell us which candidate is likely to pick up swing voters from which groups. And while this model isn’t sophisticated enough to do that yet, it does hint that such a model might be possible.

Lessons

What worked well

Recursive scraping each part of Pew political typology separately. This allowed me to study each set of topics before combining. To make this work better, I would train the left and right topic models separately, so I can apply both to tweets and keep the best-fit topic each time. This would likely separate Republican and Democrat tweets better. Like all models, this one sometimes miscategorizes content.
For any given topic, the tone for tweets on the same topic, like gender or abortion are often totally different. The difference is the signal of where the major coalitions divide.
After I realized I was missing the moderate middle influencer in model, I was able to curate this content and fill out the model. I still need more content from the far right neo-nazi accounts, but these have mostly been banned. Without this content, the model does a poor job of aligning these folks. They are effectively outside the political spectrum as defined in this model.
There are some “depends on who is the current president” topics. Talking about the economy is negative when the other party’s guy is president. This means the model won’t stay accurate forever, but is probably good if trained on data from the same year.
Eventually best to rely heavily on two sets of topics, to ensure left and right tweets are separated cleanly regardless of sentiment.
I tried many different sentiment approaches; none were great on their own, but the combination of those seems to work. If I could add more layers to this step, I would add a sarcasm detector, and code-switching detector, if that is possible. Perhaps looking up against a words that have been used for code-switching would work. That’s a great question to bring to the folks at the Code-Switch podcast.

How do you solve a problem like @AOC?

@AOC’s account is a prime example where I will need a more nuanced model. The current algo thinks she leans to the right when she is a bastion of the Progressive Left. I think this is because @AOC often quotes a tweet she is criticizing, and recycles their word choices, so the topic model captures that conservative talking point and assigns it to her. This code is too simple-minded: it assumes the tweets a person shares are the ideas he/she means to spread. Looking at @AOC tweets, which are rife with sarcasm, wit, and carefully crafted double-meaning directly attacking the right, it is easy to see why the model struggles to categorize her correctly.

Some examples:

“Thank you Fox News for making all the campaign graphics I never knew I needed”

“Oh no! They discovered our vast conspiracy to take care of children and save the planet”

“In summary, Republicans held an Oversight hearing to air grievances about their personal Twitter accounts and praise God for Elon Musk. That’s not an exaggeration, by the way. Several GOP members thanked God for the billionaire in their remarks. I wonder what he did for them!”
@AOC

If a topic model parses these tweets in a vacuum ( without the associated image), it find folks on the far right saying the exact same things – only about her. The algorithm provides some insights, in that sense:

I think with further tuning I could build a predictor that will be able to tell “On Brand” official accounts from authentic individual accounts based on the frequency and consistency of “save” Brand account messages. @AOC is kind of the polar opposite of @SpeakerPelosi, in that respect.

What doesn’t work well yet

Fluid topics won’t be tracked and incorporated. The political talking points change constantly. Even with 2 million tweets to work from going back months, some future topics will be missed.
Doesn’t handle sarcasm and coded language in tweets yet. For that, I’d need a ChatGPT level of model. This one is built on much simpler stuff.

Try it yourself

Download all results from data.world: https://data.world/marcmaxmeister/twitter-influencers-imputed-pew-political-typologies
check out spoutible.guru (where you can toggle summary data on all 962 accounts in the training set)
Sign up to be notified when this tool is available for Spoutible.com – twitter’s best alternative in 2023. They having released API access yet, so I can’t study it. But from what I’ve seen, it is an echo chamber, but I really want to use this method to quantify the “are you living in an echo chamber question” when the data is available.