Notes from a Jupyter/Dataframe project

I’m looking at some League of Legends data; champion win rates by player tier. I’m curious which champions have higher win rates with skilled players, and/or which are more popular. You can see the resulting report here, notebook (without nice CSS) is here.

Learning me some Jupyter and Pandas

Really I’m using this project to get more comfortable with using Jupyter and Pandas DataFrames to analyze data. The DataFrames part is overkill; I’ve got about 10k of data, not the 10G of numbers DataFrames’ highly optimized code is for. But DataFrames have lots of functions that replicate SQL’s analysis capabilities and they display nicely in Jupyter, so here I am.

My data starts as web pages of HTML tables. I imported those into a single Google Spreadsheet with multiple tabs.For a single sheet, rows are champions (120 or so) and columns are statistics like win rate or games played (5 of them). There’s 7 sheets, one per tier Bronze / Silver / Gold / … I then exported those sheets as CSV files, one per tier, for analysis. (Google Spreadsheets turns out to be a remarkably good HTML table data scraper.)

I then played with various forms of Python code for reading those CSV files and doing some statistics on them. Starting with simple raw Python data structures, a bunch of dicts and tuples all the way up to a full Panda Panel. Along the way I learned how to analyze the data using Pandas a bit, and to format the results nicely in Jupyter. My real goal here is to make interactive reporting as natural as if I’d loaded all the data into a SQL database. Being able to do natural things like sum all values, or get the average of numbers grouped by some criterion.

Data loading

One thing I found is that the more I whipped my data into Pandas shape, the simpler my code got. I finally settled on doing all the parsing way up front, when I read the CSV files, to make clean DataFrame objects with well named columns and rows. That makes the later analysis code much simpler.

The read_csv() function has a lot of options to configure parsing, it’s worth the time to set them. “thousands=,” for instance will allow it to parse things like “1,523,239” as an integer and not a string. I also ended up doing “header=0” and specifying my own “names=[…]” for columns, just so I could make the names valid Python identifiers. (You can access named things in a DataFrame with attribute syntax; either df[“winrate”] or df.winrate.)

But the most important thing for read_csv was index_col. The concept of index in a DataFrame is particularly important. That’s the unique key that names each row; in my data the index is the champion name. In a CSV file this would probably be the column named “label” or the like. In a database table it’d be the primary key. The word index is used all over the docs and it confuses me all the time. Once I set up my data index correctly the code got a lot better.

The thing I find most confusing is the shape of data. DataFrames, for instance, mostly seem to want to treat a column of numbers as a single feature. But I really need to work row-rise most of the time. Slicing a DataFrame is particularly bizarre. df[n] gives you the nth column from the dataframe df. But df[n:n+2] gives you rows n to n+2. I find I often have to transpose my dataframe to make operations convenient, or use the “axis” parameter to functions like sum() to indicate whether I want to sum over columns or over rows.

The final fillip was using a Pandas Panel to store my data. A Panel is a collection of DataFrames; in my case one per tier. This 3 dimensional data structure is a bit confusing, the docs apologize for the names and it’s not clear that Panel adds a lot of value over a dict of DataFrame objects. But it’s Pandas’ way of representing a collection of spreadsheets, so I am going with it. My panels “items” are the tier names, the major_axis is champion rows (named by the index) and the minor_axis is the stats (the columns.)

Data analysis

So now I have my data loaded into a panel, time to analyze. The pattern I’ve settled on is generating a new DataFrame that’s a report table. For instance I’m interested in average statistics per tier. So I start with a 3d Panel of tiers x champions x stats, and I boil that down to a 2d DataFrame of tiers x average(stats). When I do ad-hoc Python report generation I usually just iteratively print out report lines, or maybe make an array of dicts for my data. Those are still fine too, but I wanted to go full Pandas.

I started out by writing lots of laborious loops over panel.iteritems() and the like. See cell 7 and how I first build up a dict named ‘spreads’ and then converted it to a DataFrame. Over time I’ve refined to simply doing operations like dividing a DataFrame by a Series and the like. Cell 8 shows that style, the creation of pick_rates is about the same computation as spreads before, but it’s less code. The whole process feels a lot like vectorizing code, like I did for my machine learning class in Octave/MatLab. The code gets shorter, which is nice, but it should also be more efficient which matters for large datasets.

Presentation

My final goal was to make nice-looking reports right in Jupyter. That proves to be quite difficult. Pandas is amazing at inlining graphs and visualizations with matplotlib and friends. But it’s not quite as good at textual reports. I mean Jupyter will display any HTML you make in python, but I haven’t found any good HTML generation convenience libraries out there for, say, DataFrames.

Fortunately the default DataFrame presentation HTML is pretty good. Labelled columns and rows, decent formatting of numbers. All that’s really missing is right alignment of numbers (and why is that not the default?!) But I want more control, for instance rounding floating point numbers to specific precision. For that I used the DataFrame.style object, which gives you a D3-esque ability to apply per-datum CSS styles to cells. I mostly used .format() to control text formatting of numbers and .set_table_styles() to add some CSS to the HTML table. The .applymap() function is the really powerful one that lets you control things per-datum.

I also threw in the background_gradient style, to give a sort of heatmap coloring to the number cells. That required installing a bunch of crap like python-tk and numpy just to get the colormaps. But it’s pretty painless in the end.

Conclusion

Pandas + Jupyter is a pretty good platform for ad hoc exploration of data, and with some learning and effort you can produce pretty clean code and reports. I love the way notebooks are reproducible and shareable. I think using SQL for this kind of work is probably more effective, but there’s a lot of overhead in getting set up with a database. And the interactive exploration isn’t as nice as a Jupyter notebook. (Hmmm, there’s no reason Jupyter cells couldn’t contain SQL code!)

3 thoughts on “Notes from a Jupyter/Dataframe project

  1. After all the Panda-ing, did you get some sense of where it might be better to use Pandas vs munging things in explicit Python or dumping everything in a SQL db? I keep meaning to try it while hoping to minimize any actual work.

  2. The real improvement is Jupyter, the workflow of cells and displaying HTML and graphics inline with your code. That works great with Python. In theory that should work with SQL too, but the SQL/Jupyter integration I’ve found is early days.

    I’d also say Pandas is better than “explicit Python”, if by that you mean “ad hoc collection of tuples and dicts”. If your data fits into a DataFrame, ie: a 2d matrix of numbers, then Pandas is great. You get nicely labelled data, lots of analysis and slicing functions, all very simple. (And performant, if you have 100,000+ data rows).

    I’m less sure Pandas is better than a SQL db. For the simple data matrix case where you’d otherwise have a single database table imported from a CSV file, I’d say it’s about the same. But of course a SQL database can handle relational data across multiple tables. Pandas has some relational-like things, a sort of join and a pivot table, but I haven’t explored them.

  3. Thanks, that’s really helpful. I’m starting to fiddle with it now and you’re right, there sure are a lot of knobs. For my uses, the Jupyter stuff is a pleasant nice-to-have; the thing I’m starting to notice with Python is that the old adage about basic parsing and data ingestion/processing performance not mattering much as i/o dominates is no longer true. I/O has become dramatically faster while Python is as slow as it’s always been. I just ran across a case where parsing out a first field in a space separated record with the unreadable field = line[:line.find(‘ ‘)] vs the obvious field = line.split()[0] improved overall performance by 30%. Chasing that sort of micro-optimization rather defeats the purpose of using a high-level language. Pandas looks like a potential fix both by virtue of implementation and density of existing SO answers.

Comments are closed.