Undiscovered Country

Resources and tools used for the Codemotion Berlin'13 panel:
"Undiscovered Country: How To Get Started Exploring Big Data".

About the Panel/Repo

Big Data is a Big Topic, but if you don’t have a PhD or if you’ve never worked at a supercomputer, how do you get started? During our panel at Codemotion Berlin three ladies each presented a rapid 7-minute intro outlining a practical approach to big data for beginners. We discussed data visualization, statistical analysis, collaborative filtering and open datasets. Our panelists gave demos and did some live coding to dispell some of the mystery of data-crunching vodoo. This repo contains all the code and instructions you'll need to get started exploring.

Where to Find Datasets

Big Data is everywhere but sometimes it's hard to get your hands on it. Here are just a few places to find open datasets you can downlaod and start working with right away. If you're hungry for more, check out this epic Quora question which has lots more resources.

• Amazon AWS Public Datasets
• Infochimps: Big Variety, Not All Open
• Univeristy of California, Irvine: 239 Datasets for Machine Learning
• Demographic, Political and Economic Data from the EU
• Health and Social Indicators in Berlin (German)

About the Panelists

Irina Ioana Brudaru, Technical Account Manager at Google Munich

MsC in CS at Max Plank Institute in Germany. Research background, love of Algorithms/Data. 3+ years in Berlin startup world. [email, LinkedIn]

Kate McCurdy, Babbel

Kate started out in language and ended up in numbers. An academic psycholinguist turned visualization aficionado, she keeps a foot in both worlds as a data wrangler for the online language-learning platform Babbel. [website, k [({at})] k-means ])}dot{([ net]

Monika Moser, simia.tech

Monika is a software engineer who loves working on challenges in distributed systems. She works a freelance developer and consultant in the simia.tech team focusing on backend solutions. After studying computer science she did research on distributed key-value stores and became an expert in NoSQL systems. She's addicted to doing sports like Beach Volleyball, Running, Yoga, Bootcamp. [website, @momo13]

About the Moderators

Amélie Anglade, Music Information Retrieval Software Engineer at SoundCloud

Amélie is a developer at SoundCloud. By day she researches, prototypes and implements search engines, audio recommender systems and other Music Information Retrieval algorithms. By night she volunteers at the OpenTechSchool and the Berlin Geekettes trying to get more women into programming and tech. [@utstikkar, about.me]

Johanna Brewer, Co-Founder of frestyl

Live music nerd. Developer, designer, ethnographer & long-suffering vegetarian. Doctor (the PhD kind) of Information & Computer Sciences. Previously designed location-based interfaces @ Intel and algorithms for scientific research @ the Swiss National Super Computer and Massachusetts General Hospital. [@deadroxy, about.me]

Demo Walkthrough: DataViz with D3.js

Presented by: Kate McCurdy

1. Download the dataset

http://infochimps.com/datasets/60000-documented-ufo-sightings-with-text-descriptions-and-metada
We're looking at reports of UFO sightings because, naturally, an undiscovered country is going to have lots of unidentified objects, and we might as well get a head start on dealing with such things, no?

2. Make an HTML page, the vessel for our visualization

In the header - load the libraries.


    <!-- source libraries -->
    <script type="text/javascript" src="http://d3js.org/d3.v3.js"></script>
    <script type="text/javascript" src="https://raw.github.com/square/crossfilter/master/crossfilter.min.js"></script>
    <script type="text/javascript" src="https://raw.github.com/NickQiZhu/dc.js/master/dc.min.js"></script>

    <!-- our script -->
    <script type="text/javascript" src="CODEMOTION.UFO.VIS.js"></script>

    <!-- dc.js stylesheet -->
    <link rel="stylesheet" type="text/css" href="dc.css"/>

We'll be using Mike Bostock's fantastic library D3.js, which is flexible and powerful but operates at a fairly low level (i.e. targeting individual SVG elements) - so even though there are all sorts of wonderful tutorials out there to ease the process, it can take some time to get a basic chart up and running.

To ease the process, we'll use the higher-level dc.js, which provides a nice wrapper to quickly build interactive D3 charts. DC.js also depends on crossfilter.js, another Bostock library that provides tools for slicing and dicing data across multiple dimensions.

The javascript libraries can be hosted remotely, but for styling you'll want to save a local copy of dc.css.

In the body - copy and paste the anchor divs from the dc.js homepage, with some minor modifications:


    <h1>UFO Sightings</h1>

    <div id="date-chart">                                                                   
    <p><span>Date of sighting</span>                                                                    

    <a class="reset" href="javascript: dc.filterAll();dc.redrawAll();" style="display: none;">reset</a>  

    <span class="reset" style="display: none;">Current filter: <span class="filter"></span></span></p>
    </div>

    <div id="shape-chart">                                                                     
    <p><span>Shape of UFO</span>                                                                     

    <a class="reset" href="javascript:dc.filterAll();dc.redrawAll();" style="display: none;">reset</a>       

    <span class="reset" style="display: none;">Current filter: <span class="filter"></span></span></p>
    </div>

Here's a commented example.

For any who have absolutely zero Javascript experience (we've all been there) - to view the html page in browser with Javascript, you'll generally need to have a local server running. If you're using a Unix system (this includes Macs), this can be accomplished with one command in your console:


    $ python -m SimpleHTTPServer

Now open your browser to http://localhost:8000 (8000 being the default port), and you should be able to navigate directly to the file.

3. Open up the script

The commands to load the data are already set up.


    d3.text('ufo_awesome.tsv', function(error, data) {

    var ufo_data = d3.tsv.parseRows(data, function(d) {
        return {
        sighted_at: new Date(d[0].substring(0,4), d[0].substring(4,6), d[0].substring(6,8)),
        shape: d[3]
        };
    });

    // More functions coming here

    });

Here we're loading in the data with d3.text. The first argument is the data we're uploading, the second is a callback function that executes once the data finishes loading. Once the data loads, we parse the rows with d3.tsv to get the info we need: date of sighting (converted to a Javascript Date type) and reported shape of the UFO. (Normally we'd just load and parse the data in one step with a call to d3.tsv, but that setup assumes a header row that our dataset lacks, so we need to load and then parse.)

Everything added after this point stays inside the d3.text callback function. You may have noticed the helpful comment I left there.

4. Apply a set of crossfilter commands

We take the parsed data and set up a structure so that the data can be grouped easily.


    var cx = crossfilter(ufo_data);

    var date_dim = cx.dimension(function(d) { return d.sighted_at; });
    var shape_dim = cx.dimension(function(d) { return d.shape; });

    var count_by_date = date_dim.group().reduceCount();
    var count_by_shape = shape_dim.group().reduceCount();

The first command makes a crossfilter object. The next two set up the dimensions of interest, here date and shape. The final two commands tell crossfilter how to aggregate ('group') along these dimensions. In this case we're just counting each record; if we had a numeric variable (say, Number Of Aliens On Board), we could have called reduceSum to aggegate by summing instead, or made our own custom accessor function for group reduction.

5. Add charts

We'll create a bar chart showing number of UFO sightings by date from 1985 to 2011 (when the dataset was last updated) and a pie chart broken down by shape.


    dc.barChart('#date-chart')
    .dimension(date_dim)
    .group(count_by_date)
    .x(d3.time.scale().domain([new Date(1985, 0, 1), new Date(2011,0, 1) ]));

    dc.pieChart('#shape-chart')
    .dimension(shape_dim)
    .group(count_by_shape);

    dc.renderAll();

The call to dc.renderAll draws the charts. Check it out in the browser!

6. Make things pretty

One thing you may have noticed is that the bar chart was much too small. We can fix that, and add some nifty options like titles on hover and y-axis resizing when we filter from the pic chart


    dc.barChart('#date-chart')
    .dimension(date_dim)
    .group(count_by_date)
    .x(d3.time.scale().domain([new Date(1985, 0, 1), new Date(2011,0, 1) ]))
    .width(1000)      // set chart width
    .brushOn(false)   // take away interactivity (necessary for titles)
    .title(function(d) { 
        return d3.time.format("%b %d, %Y")(d.key) + // put date in pretty format
        ": " + d.value + " sightings"; })
    .renderTitle(true)         
    .elasticY(true);    // auto-resize y-axis on filtering

7. Admire the finished product (/with comments)

Woohoo! Have fun investigating the time course for different UFO shapes by clicking and filtering on the pie chart. I especially invite you to click on the category 'fireball' and tell me what you think was happening on December 16, 1999. Goodness gracious!

Demo Walkthrough: R for Statistical Analysis

Presented by: Irina Ioana Brudaru

Coming soon.

Demo Walkthrough: Redis for Recommendation

Presented by: Monika Moser

Redis is a NoSQL DB that is well suited for high read and write accesses, especially when small data losses can be accepted. Data is kept in-memory but redis also provides different persistence models. This makes redis useful, for example, to do realtime updates on counters for popular Twitter hashtags.

This Demo shows how redis can be used to implement collaborative filtering using a sparse matrix. The underlying dataset for the demo is a RSS feed of bookmarks from delicious.com. Redis is used to store the combinations of tags that were added to bookmarks. Based on the data stored in redis recommendations for related tags can be made. The demo shows a simplified scenario and demonstrates the power of sorted sets in redis.

1. Download and compile redis


    > wget http://redis.googlecode.com/files/redis-2.6.13.tar.gz
    > tar xzf redis-2.6.13.tar.gz
    > cd redis-2.6.13
    > make

2. Install the Ruby gems


    > gem install redis

3. Start the redis server


    > redis-server # will listen on port 6379

4. Start the redis shell


    > redis-cli

5. Try a few simple redis commands


    redis> SET title "redis for recommendation"
    redis> GET title
    redis> DEL title

    redis> HMSET talk title "redis for recommendation" location "Berlin"
    redis> HSET talk duration 7
    redis> HGETALL talk
    redis> HINCRBY talk duration 1
    redis> HGETALL talk

6. Download the dataset

http://www.infochimps.com/datasets/delicious-bookmarks-september-2009

7. Load dataset into redis (this will take a while, go get a coffee)


    > ruby load_delicious_data.rb data/delicious-rss # (~1250000 lines)

The matrix for the tag combinations forms a sparse matrix and can be represented with sorted sets. For each tag we store a sorted set of tags that were combined with it. Scores of the sorted set reflect the number of combinations. (No optimizations here, simplest approach as possible)

8. Check number of keys and memory usage


    redis> INFO

9. Find 10 most popular tags


    redis> ZREVRANGE "delicious:all" 0 9 WITHSCORES

10. Find all that have been combined at least 1,000 times


    redis> ZREVRANGEBYSCORE "delicious:all" +inf 1000 WITHSCORES

11. List all tags that have been combined with "photography" sorted by popularity


    redis> ZREVRANGE "delicious:photography" 0 -1