I just finished a short project I was working on, an analysis of data from League of Legends about how people are using the new rune system the game just introduced. The report is here and speaks for itself; writing here to talk about how I wrote this up. For context the project consisted of scraping about 250 web pages off a site, parsing them, stuffing data into a database, then analyzing it for patterns.
I decided to do this project as a purely Jupyter notebook project. In the past I’ve used notebooks frequently for data exploration but seldom as my full workflow. In particular it’s not clear notebooks are well suited for production scripts like “scrape this website” but I decided to do it anyway, with some success.
- I wish there were a way to hide cells. I want to keep some code around but I don’t want it to show in whatever report I’m publishing.
- I hid the code entirely so civilians don’t get overwhelmed. I did this with a “toggle code” button borrowed from here.
- I wish notebooks had explicit named checkins. I find the commit cycle of git is good discipline for getting me to do only one thing at a time. There’s a Checkpoint feature in notebooks which is similar but it doesn’t prompt you for a comment so I never use it.
- It seems worth noting that my virtualenv for the project was 550MB. Pandas is big but I wanted it. Somehow I ended up with all of scipy even though I just wanted a few colormaps from Seaborn.
I wrote the scraper with BeautifulSoup. As noted before there’s a nice addon for Jupyter that gives notebook previews of scraped data. Interactively working towards robust scrapers was really nice in a notebook. I ended up creating a main loop to scrape all the pages I needed and write them to a simple shelf database. The full scrape only took a few minutes so I just ran it interactively; this wouldn’t work nearly as well for a run that takes hours or is perpetual. One clever thing I did was create a new shelf database for each run, so I could keep old data easily.
Data processor and database creator
My second notebook loaded the scraped data from a shelf and processed it down to a more query-friendly SQLite database. Like shelf files I kept one timestamped database per run so I could easily go back to old data. Interactive development for this part was nice, particularly being able to do some very simple sanity check reports along the way.
I used the dataset module to create the database for me. It’s quite nice; create a list of dicts and it just makes a schema from them and writes the data, no fuss.
Ad hoc queries
For my primary data exploration I used ipython-sql to run SQL code directly as Notebook cells, no Python requires. This was my favorite use of notebooks and no surprise, it’s exactly the kind of work Jupyter is for. Write query, see data as HTML table, done. ipython-sql allows you to mix in Python code, you can capture the results of queries as result sets and do things with them. I started trying to get fancy with that and realized that wasn’t working very well; better to stick to mostly SQL. Also the presentation options are very limited, once I started thinking about sharing this data with other people I wanted to give it more polish.
For my final report I made a new notebook using Pandas DataFrames with sql queries developed in the ad hoc notebook. Mostly used Pandas as a better table presenter; it makes it easy to round off numbers to 2 digits or color data cells based on their contents. Also I ended up using Pandas Python code to manipulate my data after it came from the database, convolutions that would be awkward to do with SQL queries. This all worked fairly well but in retrospect part of me thinks I should have gone with styling JSON blobs using D3 instead. That’s more work though.
The result of this sheet is the report itself and while it’s a fine Notebook it’s not a very nice deliverable for my target audience. LoL players are used to infographics like this, not boring tables. I’m uninterested in doing that kind of presentation though, so I stopped there.
Update: I posted this to Reddit and got on the front page, as much attention as I expect for a project like this. OTOH most of the discussion doesn’t seem informed by the data I presented. Good ol’ Internet people. Lesson for me in the value of presenting simple conclusions, not data.