NDC Sydney talk

In August this year, I gave a talk at NDC Sydney on Real-time Twitter Analysis with Reactive Extensions. NDC is the Norway Developers Conference, so it’s a natural progression for them to come to Sydney. This was their first time down under, but they’ve already announced they’ll be back in August 2017.

It was a three day conference with over 100 speakers, some international and some local. That’s a lot of speakers, and it translates into 7 parallel tracks, or 7 concurrent talks.

Full talk video and code online

The video of my talk is on NDC’s vimeo channel, and the code and data driving the visualizations in the talk is on github. The repository is fairly large because the data files total a couple of hundred megabytes.

I’ve written a few articles covering parts of the material in the talk and discussing the code approach:

My aim: showing Rx can make dry data interesting

I wanted to demonstrate of how effective the Reactive Extensions framework can be with real-time data. I also wanted to inspire people to look into what they can make of their own real-time data. Most systems have real-time data – more than you’d think. It’s just not often a focal point for analysis, investigation or feature development. Getting deeper, real-time analysis out of your data can make it suddenly a lot more valuable.

So I wanted to show that even dry, seemingly boring data can yield interesting insights when it’s manipulated and analysed in the moment, with an added perspective of time. And I wanted to show how capable Rx is in this area.

The data: real-time political Twitter streams

The data I had was pretty dry – Twitter discussion about politics during various debates, panel shows, convention speeches, and the occasional national exit from the EU. The real-time analysis of this data made the politics itself a lot more interesting to me than it would have otherwise been.

It made me realise that there’s huge power in understanding, playing to and even manipulating the audience reaction. I also realised that if I could do this at home, there must already be companies out there doing the same thing on a much larger scale.

All my data came from Twitter. It’s an easy medium for people to understand and its real-time, streaming nature is baked into its identity. The API gives you a good handle on the data, so it’s not hard to stream the data into an analytics engine.

I wanted to be able to take a look at some of the major political events of 2016 – Australia’s federal election, the EU referendum, and the US presidential election. So I recorded (a lot of) streaming data from Twitter during various events, and setup my code so that I could replay the tweets as if they were coming through live.

I ended up with data from:

  • Countless episodes of the ABC’s Q&A program in the lead-up to the Australian federal election of 2016
  • Many debates and tv appearances from prominent figures in the UK’s EU referendum, as well as polling day itself
  • The major speeches from the Democratic National Convention in the USA, where Hillary Clinton became the first female nominee for president, as well as many states’ polling days in the primaries

Technical “issues”

The Twitter API is great, and you can get a lot for free. But you can’t always get everything. My code streamed any tweets using a set of hashtags, which is an easy way to tap into the world’s thoughts on a particular topic. For Q&A, I’d stream for #qanda and #auspol. For the EU referendum and US presidential race I had to bundle a bunch of hashtags and politician’s names together into the stream.

The problem wasn’t getting enough data, it was getting too much data. This problem came in two shapes.

Millions of tweets take a lot of space

Firstly, during major events with big audiences, like the EU referendum and the Democratic Convention, I would get millions of tweets per day. Storing all this data for replay later takes gigabytes. Gigabytes are cheap, but not when you want to load them up in Javascript in a browser. Then gigabytes are expensive, and tend to crash your page.

For the demos in the talk I had to tightly trim down the time ranges I was looking at – especially for the Democratic Convention, which went on for hours a day over several days. I also had to extract a one in four sample to cut the file sizes down to something manageable. This isn’t something you’d have to do if you were actually analysing this data in real-time, but it certainly threw in challenges for the data replay.

Maxing out Twitter’s streaming cap

The second problem was that Twitter’s happy to give you all tweets on your search topics for free, up to a point. If you’re getting too much traffic, they cap your tweet rate and give you a sample. I don’t know whether this is to protect their bandwidth usage, server utilisation, or just plain commercial interest.

The free API cap is around 3,000 tweets per minute, which is a lot of tweets. However, during the EU referendum, it topped out about 12,000 tweets per minute. So at that point, I would have only been getting a 25% sample. Major events in the USA are on another scale again.

A major part of my talk was around analysis of tweet rate over time, picking out spikes in traffic, and analysing the causes of those spikes, as you can see in this chart:

turnbullspikes-1200

None of this is possible when the stream rate is flat-lined at 3,000 tpm.

So it turned out the Q&A streams were really valuable. Thankfully there are a lot less Australians than British, Europeans or Americans, and even fewer Australians that watch and tweet about Q&A. A typical Q&A episode would range between 500 tpm and 1000 tpm. This is well within Twitter’s cap, so I got all the data and all the spikes were clearly visible.

You can buy full access to everything, either the firehose stream or pulling historical data through Gnip, but bring your credit card. There’s no quoted price, you tell them what you want to do and who you are (ie: how deep your pockets are) and they decide how much it’ll cost. Presumably as a non-trivial fraction of your pocket depth.

Processing 1 million words per second

The great thing about having data to replay is that you don’t have to replay it at normal real-time. Instead, I replayed the data around 300 times real-time. I’ve talked about the technique for this in an earlier post.

This is great for keeping the demos short and snappy, but it can push the browser pretty hard. This is especially true if you’re doing some heavyweight calculations or memory allocations as the stream flows. Some of the analysis involved generating word count histograms of current discussion topics, which looks like this:

The histogram code used sliding windows of 5 minutes, 5 seconds apart. This means the code is actually processing 60 concurrent windows of tweets. There were roughly 10,000 words per minute flowing through the tweet stream, and that data was being pushed through at 300 times the normal rate.

This gives a theoretical peak data processing rate of:

60 windows x 10,000 words x 300 replay multiplier
= 180,000,000 words per real-world minute
= 3,000,000 words per real-world second

Trying to achieve a throughput of 3 million words per second tended to make the browser sad and it struggled to maintain the pace.

In reality the browser tended to max out around 1 million words per second.

I was pretty impressed by that, given that I didn’t spend any time improving performance. With time constraints and the unrealistic scenario of replaying the data 300 times faster than normal, I left it as it was. But it’s a nice demonstration of what this technique can do for load testing.

Practice practice practice

Sad to say, but I totally ran out of time for rehearsing my talk. I did get some time for rehearsals, but not as much as I wanted. The reality was that the content was just not completed early enough to get a proper amount of rehearsals in. But such is life.

I kept a rough worklog for interest’s sake, to see where I had spent my time. In total, I estimate I spent between 150 and 200 hours on the talk, which at first didn’t sound that huge to me until I realised that’s 4-5 full time weeks of work.

The rough breakdown of work time was:

  • 10% – Building recording apparatus
  • 10% – Gathering source data
  • 50% – Analysis and visualizations
  • 25% – Creating slides and talk content
  • 5% – Rehearsal

Building the visualizations took a long time. Many people give talks about a system they’ve built or an experience they had. From its genesis, my talk was about a system I knew could be built, and I thought could be cool but didn’t actually exist at the time. So I was kind of fighting a war on two fronts, which historically tends to go well.

My experience at NDC

The conference was great. I met a lot of interesting people, some I had never heard of before, and some I had been following for more than ten years, like Jon Skeet and Tess Ferrandez. And the creme brûlée was very good.

The talks were interesting, Jon Skeet’s talk on C# 7 was probably my pick, and I enjoyed the brief discussion he had about tail call recursion in the CLR. The huge number of speakers meant you could choose some different topics and hear about something new, to an extent. It was different to YOW, which I really enjoyed because I knew nothing about almost every talk I went to. The comedy spot from James Mickens was Michelin star worthy, haute developer humour.

Also, kudos to Jon Skeet for calling out the poor diversity at the conference. It’s something that’s not simple and easy to solve – which is why it hasn’t been solved already, and why we should put determined effort into doing so. It’s not the responsibility of one or a few people, or of a few or many conferences – it needs a concerted effort from the community as a whole to make a change for the better.

All in all, it was a great experience for me to attend the conference, and to speak. It took a lot of work, but it was worth it. Thankfully, I can stop repeating Damian Conway’s Instantly Better Presentations, at least for a while.

Sad news

Finishing this post on a sad note feels weird. I wrote most of this a while ago and just came back to finish it up. But just yesterday I found out that Leslie Nassar has died. He was the developer and driving force behind the ABC’s Q&A Twitter integration and engagement (among other things).

It’s fair to say that without his work on Q&A, my talk wouldn’t have happened. I only met him once and so can’t say I knew him well. But I can say thanks for his work, and best wishes to his family. To lose a father of three young kids in this way must be very hard.

3 thoughts on “NDC Sydney talk

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s