During my recent NDC Sydney talk on real-time Twitter analysis with Reactive Extensions, I talked about the approach I used to track current discussion topics as they changed over time. This is similar to Twitter’s trending topics, but changing more dynamically.
The source data came from Twitter traffic during two episodes of the ABC’s Q&A show in the lead up to Australia’s 2016 federal election. Each of the candidates for Prime Minister – incumbent Malcolm Turnbull and opposition leader Bill Shorten appeared as a solo guest to face questions from the audience.
I wanted a live view of the current topics of discussion as the show progressed, to get a feel for which topics the Twitter audience was responding to.
Streaming real-time word mentions
I needed a stream of all the words people were tweeting. Once I had this, I’d be able to gather stats on which words were getting the most mentions. This is quite a simple task at its core, as it just involves taking each tweet off the stream and splitting its content text into words.
The only catch was that I had to filter out all the filler words in English that will inevitably be the most mentioned – I, you, me, a, the and so on. It turns out there are lists of stopwords that can be used as a starting point for filtering.
The other issue was one of data quality. The better the string splitting handled things like punctuation, urls, parentheses, etc, the better the results. My approach was pretty basic. It didn’t handle merging similar words together (like plural forms), or looking for multi-word topics, like “climate change” instead of “climate” and “change”.
With that in mind, the code to build a stream of word mentions comes down to two lines:
tweets.flatMap(tweet => tweet.text.split(" ")) .filter(word => isValid(word))
Splitting on spaces can be replaced by a regex to split on spaces, punctuation, etc. But you can end up with slices of a url when splitting on dots, for example. As I said, the better you do this, the better the results, up to a point.
Keeping track of current popular topics
The next step is to take this stream of word mentions and identify which words are the most frequent. I wanted to get progressive snapshots of the current popular topics over time, which requires looking at the words stream in windows of time.
I chose to look at periods of time 5 minutes long. This is based loosely on how long topics on Q&A tend to last for, as well as the average human attention span on Twitter. It could have been a shorter duration, but I didn’t want it too tightly locked on one period of time. If you apply this approach to other situations, it’s something to consider.
Aggregating word counts – groupBy vs reduce
I thought of a couple of different ways to take a time window of words and generate the word frequency details. The first was to take the window and build up the counts in a dictionary/hashmap as each word came in. The second was to group the window by each word, and extract the count for each word.
Here’s what the groupBy
approach would look like:
words.groupBy(word => word) .flatMap(mentions => mentions.count() .map(c => ({ word: mentions.key, count: c })) ) . <something goes here>
My first impression was that the groupBy
approach was better, mostly because it splits the source stream into one stream per unique word. This means you can just use count()
on each unique word’s stream. But I had two problems with this approach. The first was that it wasn’t obvious to me how to join all these counts back together into a value that summarised the last 5 minutes as a whole. This is where the “something goes here” above needs filling in.
The second issue I had with groupBy
was that with so many unique words floating around, it didn’t seem like a great idea to split the stream into several thousand smaller streams, extract the count and somehow glue all these thousands of streams back together again. It seemed overly complex.
So I went with an approach similar to how I’d do it manually – keep a running tally, and update the tally as each new word came in. This is certainly a lot simpler to think about, and I’m pretty sure it’d be a lot less work on the cpu. But it’s a very not functional approach to mutate a stateful object as you go. Of course, you could make a whole new hashmap object with the latest tally data for each new word, but now the garbage collector would be screaming.
This was a less pure approach that worked and was simpler (at least for me to think about). Here’s what it looks like:
words.reduce(updateCounts, { })
As you can see, it’s a lot simpler. I’ve tucked the code that does the running tally into a method – updateCounts
. All updateCounts
does is take each new word on the stream, makes sure it’s in the hashmap, and updates its count. The initial empty map is passed in as the second argument.
At the end of each 5 minute window, the hashmap has the tally for all the words seen in that window. This is easy to use to plot a histogram.
Show me the money
That’s a lot of talking about code, but here’s a look at what it actually produces. I think it’s quite cool.
This shows topics moving into focus, growing to their peak, and fading away. You can definitely get a feel for how some topics burst onto the scene, whereas others make only minor ripples. There are several distinct themes through the episode:
- The gay marriage plebiscite
- Medicare
- Tax – especially the proposed small business tax cuts, along with trickle down economics
- Suicide and mental health
- Manus Island, refugees, asylum seekers
- The NBN – by far the most mentioned word
So what are the hottest topics?
Surely to find out the hottest topics, it’s just a case of picking the words that have the highest mention rate, or most total mentions? Well, it’s not that simple.
Looking at the question from the Manus Island detainee during Malcolm Turnbull’s Q&A appearance, there are a lot of words that jump into the conversation – Manus, asylum, people, detention, boats, refugees, etc. The discussion is split across many words, but they’re all on the same topic. To find out which topics have the biggest impact, a better way is to look for spikes in the conversation, and then work out what people are talking about at the time.
I discussed spike detection in a previous post. Finding the topics during spikes is just a matter of sampling the latest value of discussion topics at the moments when spikes are happening. If we pull the discussion topics code above and spike detection code from my other post into methods, it boils down to an approach like this:
getSpikes().withLatestFrom(getTopics(), (spike, topics) => topics.slice(0, 3)
Every time there’s a spike, this code takes the top three most popular topics from our topics stream (assuming they’re already sorted). I found taking all topics within a certain percentage, say 70%, of the most popular topic worked fairly well.
An automated, real-time summary generator
What this generates is a set of highlights from the Twitter data, in real-time. Each time there’s a spike, we get a snapshot of the discussion at that point in time:
We can see a diverse range of topics making spikes in traffic for Malcolm Turnbull’s appearance on Q&A. If we wanted to get fewer highlights, all we’d need to do is tighten up the criteria for how big the spike needs to be before we notice it.
Comparing audience reactions to Turnbull vs Shorten
Interestingly, Bill Shorten’s episode had a very different shape of Twitter discussion. During Shorten’s episode, Tony Jones – the host – interrupted frequently. Presumably the ABC didn’t want having just a solo guest on the show to turn into a free, hour long campaigning opportunity, so they tried to keep a rein on things.
But the interplay between Bill Shorten and Tony Jones was so fractious that it turned out to be (almost) the only thing the audience really reacted to:
This looks bad for Shorten. The last thing he and his campaign managers would have wanted out of his appearance would have been for everyone watching to be talking about Tony Jones.
Building blocks
By building on a couple of pieces of more raw analysis, it’s possible to extract some deeper, more complex insight without needing more complex code to do it. There’s a lot of potential in these kind of approaches to look for patterns in real-time data and explore, exploit or alert someone about them.
[…] a previous post I discussed the approach I used to identify the topics of Twitter discussion over time. Mapping them out into a histogram in real-time made for an interesting view of the event […]
LikeLike
[…] Tracking Twitter discussion topics in real-time […]
LikeLike