Picking a March Madness bracket using NLU
The NCAA basketball playoffs are here. It's time to make complicated bets on the outcome of a single-elimination tournament. Time for basketball fans to act like they know things about statistics, and statistics fans to act like they know things about basketball, and fans of both basketball and statistics to win stuff. Time for March Madness.
I enjoy following the March Madness results but I never have a good plan for how to fill out a bracket. When I saw a link to Coder's Bracket, I thought it was an awesome idea: instead of making lots of arbitrary decisions, write a computer program to pick the bracket for you. But what do I know about any of these basketball stats that it uses as input?
What I know about is text analytics, and so Luminoso's text analytics are going to pick a bracket for me. It may not win a billion dollars from Warren Buffett, but it'll be fun.
Picking winners based on Twitter activity
I told our Twitter listener to find March Madness-related terms such as "NCAA", "basketball", "bracket", and of course "March Madness". I also gave it the full names of all the teams according to Wikipedia, in case those helped. I collected tweets from about 12 to 5 PM today.
My assumption is that the teams that are more likely to win are the ones that more people are talking about on Twitter. If the Florida Gators end up matched against the Tulsa Golden Hurricane, we can see that more people are talking about the Gators than the Golden Hurricane and pick them as the winners.
The only problem is how to measure this. The full names of college basketball teams are often complicated and unlikely to be used in casual conversation, such as the "Stephen F. Austin Lumberjacks" or the "Louisiana-Lafayette Ragin' Cajuns". But shorter versions of the names are ambiguous. If someone says "Iowa", are they talking about the Iowa Hawkeyes or the Iowa State Cyclones? If you say "Go Wildcats", which of four teams are you cheering for? And are people who tweet #GoState even trying to communicate anything?
First, I should check my assumption. Can I just count the number of occurrences of the full name of the team? Not really. This fails for a few reasons:
- The results are noisy. The numbers are small (about 5 to 100) and they change drastically when you use a slight variant of the team name.
- The results may be biased, in that some teams may be more likely to be called by their full name than others.
- Most of the people who talk like that are spammers auto-generating tweets. Unless I believe that spammers have some key insight on who's going to win, I should probably disregard this data.
On the one hand, this sounds like a job for Luminoso and semantic similarity.
On the other hand, basically all of the semantic information that would link teams to nicknames is missing here. People will put "Gonzaga" next to other teams in a list of teams they're rooting for more often than they'll put it next to "Bulldogs" to form the whole team name, so there is no accurate way to determine automatically that "Gonzaga" and "Bulldogs" are the same team.
But Luminoso's term statistics can tell us which terms seem to be interesting, and many of those are team names.
Assigning relevance to teams
The relevance function in Luminoso finds words and 2-3 word phrases that appear more than you'd expect from the general distribution of words in English. This isn't just word frequency: phrases can be more relevant than their individual words if those words usually appear together, and it de-emphasizes common words like "next" (not to mention "the") in favor of ones that are more interesting in this set of data.
We have an ordered list of relevant terms that we extracted from the collected tweets. It contains things like "tournament bracket, ncaatournament, Iowa, Gators, Duke, MSU, ...". The only problem, as described before, is assigning these relevant terms to teams.
We know now that we can't hope for a clear one-to-one mapping (some of these aren't teams anyway). But we can go through the list of terms in order, and ask: "Does this term appear in exactly one team name?"
Each team then ends up with the relevance score of its most relevant unique term. In the list above, for example, we'd start by skipping "tournament bracket" and "ncaatournament" because they don't go with a team. We then skip Iowa because there are multiple teams from Iowa. We keep "Gators", and we'd assign its score to the Florida Gators if they don't already have a higher score.
I had to make sure to allow for abbreviations, such as "UConn" for the University of Connecticut Huskies and "NC State" for the North Carolina State Wolfpack. But once this was done, it seemed to be a fairly resilient way to rank the teams by their Twitter buzz. It was probably a bit unfair to the Kentucky Wildcats -- both of their names are ambiguous, so they could only count for the full phrase "Kentucky Wildcats" -- but that just goes to show that they should pick a more creative name.
The results, including which term was the best for each team, are shown in this spreadsheet.
There's a tweak in the code that says that 15th and 16th seeds can't win. I added that, not just because it's a very good assumption, but also because otherwise we'd give too much weight to a team such as Cal Poly that was just about to play a play-in game as the data was being collected.
Finally, a concept cloud
One of these posts wouldn't seem complete without showing the semantic space of all these mostly-March Madness tweets as a concept cloud. The words and phrases here are sized by their relevance score, colored by major topics (the tournament in general, and teams that people are rooting for), and positioned so that related concepts are near each other.
Disregard Putin, he's getting into everything these days.