Data Science London Hackathon
James Billot | COOLGARIF TECH
Data Science London is a wonderful community of very clever folk and we've been going along to their meetups whenever we get the chance. So when news came of their 24hr Data Science Hackathon we were very enthusiastic about getting involved.
We only won it..
They asked us to write about our experiences in the hackathon and answer some interview questions:
How we developed our solution
Prior to the event Richie and I had discussed approaching the event with a few ideas and approaches so that we could quickly hit the ground running once the event kicked off. We wanted to play around with the idea of 'influence', it's a fairly nebulous term, and specifically what could these 300 accounts do with their influence should they choose to wield it?
On the day we needed to find another couple of teammates and luckily Nicola and Alessio came wandering past with a handmade sign reading 'Node.js and D3'. After a few minutes of geeky technology chatter and introductions, it was a done deal and team Londinium.JS was born.Half the team are originally from Rome, we're all based in London (it made sense at the time).
It quickly became apparent that the data presented didn't really fit well with a network/graph-based visualisation which we had been considering attempting, although there was some data outlining links between individuals, none of those links corresponded with individuals within the 300. That meant that we had to move away from our preferred visualisation ideas and look more towards presenting the available information in an easily-digestible format.
At this point we decided to 'noodle-around' with some ideas individually for an hour, and then to double-down on whatever we considered to be the best idea. When the hour was up, we had written a tidy little python script to download and parse the twitter images, the start of a tiled-wall of twitter profiles with overlays exposing more information when moused-over, some spider-graphs built in d3.js of an individual's influence on particular topics, and some ideas surrounding a graph portraying the change of an individual's influence over time. We decided on a mash-up of all three ideas as they all told part of the same story.
Nothing in the data required that there be a live-feed or API to provide the data to the visualisation, so one of our pairs took on the task of converting the data into an aggregated/static form that could easily be embedded directly on the page. The other pair got to work mining the data for additional insights beyond that which was originally provided.
Our visualisation needed an 'angle' and we settled upon the idea of the 300 wielding their influence to incite revolution. To further decorate the data provided, we wanted to generate some 'what-if' scenarios, given an individual's influence, authority, and number of followers; how many people could they incite to revolt, and what would that curve look like over time?
Following on from this we made the assumption that each set of followers of the 300 would start off in an undecided state as to whether to revolt, or not. Then each follower would be convinced to be pro-revolution, or decide to be anti-revolution at some point in the future, based upon the number of tweets that they received, and the relative authority of their â€˜influencer'.
We generated the cumulative data in this manner for all 300 individuals, scaled down by 33% to allow for overlap in twitter followers (an arbitrary number selected by the team in the absence of any data on this metric), which was used to create the main graph at the top of the visualisation. We also generated this for each individual in turn, with the intention of adding these graphs to the individuals' overlays, but, we ran out of time to integrate this.
What was your background prior to entering the Data Challenge?
James Billot and Richie Barter are co-founders of Coolgarif Tech, a London based digital agency focused on all things data-related. Alessio Fabiani has previously been a Research Engineer at the University of Illinois, Chicago, in the field of Semantic Web, Machine Learning, and Big Data. Nicola Greco was a Wired 'Top 10 Italian Under 21', and founder of a seed-backed startup BrunoApp which offers cloud computing social network analysis, winning an award from the European Commission and was presented at TEDx.
What made you decide to enter this Data Challenge?
James and Richie: We were keen to meet other people with a similar interest in this area, learn from them, stretch ourselves, and have some fun.
Alessio and Nicola: We were looking forward to catching-up with each other and we can think of no better venue than a Big Data Hackathon!
What data preprocessing, machine learning, or data visualisation methods did you use?
Initially, we wanted to get a feel for the 'shape' of the data, which we accomplished by pushing the data into a MongoDB instance on Azure, and then writing NodeJS scripts against it to do some exploration: determining spreads of values, interconnection (if any) between individuals within the dataset, etc.
For the visualisation itself we used a combination of simple line-graphs, spider-graphs, and some custom JS to tile everyone's profile images which we pulled from Twitter.
What was your most important insight you got from the dataset?
How worryingly prevalent footballers and people 'famous' for public displays of stupidity are in the list.
Were you surprised by any of your insights?
From our analysis of the data, if these 300 influential people put their heads together could they really incite rebellion? How strong is their individual and collective influence in real terms? Fun as this analysis was in the context of the event, team Londinium (soberly) conclude that the fair citizens need not fear a twitter-led revolution just yet!
Which tools and programming languages did you use?
Back-end for data 'noodling':
MongoDB on a WindowsAzure VM instance
NodeJS on a Windows Azure VM instance running Ubuntu 12.10 with the following libraries installed via npm:
- Python script to download and parse the twitter icons
- WindowsAzure free website instance for serving the visualisation
Front-end for display of the visualisation:
- HTML5 and CSS3 (compiled from LESS)
- D3js for charts
- Bunch of custom js for various elements of the visualisation
Dev tools used:
- git for version control
- Sublime Text 2 - editor
- WebStorm - JS editor
- OSX and Ubuntu 12.10
What have you taken away from the Big Data Hackathon?
New knowledge, new skills, new connections, new friends, adrenaline, and no small amount of sleep-deprivation.
What did you think of the 24 hour hackathon format?
Really enjoyed it (although if you had asked us at 4am Sunday morning we might have given a slightly less enthusiastic answer)! It's long enough that you can accomplish some meaningful results, but short enough to keep you motivated, the finish line is always within sight.
We spent every single second available on the task - deploying our last line of code in the final minute of the competition.
What do you think about Data Science London community?
The community is great. Everyone; participants, hosts, sponsors, organisers; are extremely friendly, approachable, enthusiastic, and happy to help out. Even at stupid-o'clock in the morning there were lots of delirious smiles everywhere.