I finally sat down and learned some IPython for data exploration today. I re-examined some data I had for Battlefield 4 players, something I’d published as my BF4Plots visualization a year or two ago. It’s the stats on ~5000 top Battlefield 4 players, things like their win/loss ratio, skill score, etc.
You can see my notebook in this gist. That’s a static snapshot taken from my last run. If you run it yourself the results will change slightly as some of the algorithms I apply include randomization.
My main goal here was to get more comfortable using IPython. It’s awesome! I really love having the inline, easy to produce graphs. I mostly followed this tutorial for guidance.
Secondarily I was also trying to apply some cluster analysis to my BF4 data to see if I could learn any insights. Not really. This is my first foray into my own machine learning project applying stuff I’ve learned in my course. And it was interesting, I definitely felt I understood what the scikit algorithms were doing. Also a little lost how to get real meaning out of the data. No surprise, that’s the hard part!
The main IPython thing I’m still adjusting to is that it’s a notebook, not a program. I’m so used to iteratively working and re-running my program from scratch. But IPython encourages you to keep a persistent Python VM around and iteratively add to its state. I keep looking for the “Start over and run it all from scratch” button, and there sort of is one, but that’s not really how you’re supposed to work. Which got me into trouble a couple of times. Also I do wonder how IPython people transition their code to running programs for other people to reuse.
Update: a “restart and run all” button was just merged into the main codebase so that’s coming soon. And apparently I’m right that notebooks are more for exploration. When it comes time to create reusable code you create a normal module with some other code editing environment. You can certainly start by pasting your notebook-developed code over, though.
First Python library I learned was Pandas, the data management library. Really all I did was work a bit with DataFrame, which is a data container type that’s a 2d array on steroids. Columns are typed, everything’s nicely addressable and searchable, and it’s all efficient numeric spare arrays behind the scenes. I think it even supports multicore number crunching in a transparent fashion. Really nice library, a joy to use.
Second library I learned was matplotlib, some basic bashing at it to draw some graphs. It’s really great and I wish I’d invested the time to learn this before. But until I saw the HTML notebooks it wasn’t very compelling to me. It’s funny, Mathematica notebooks have been a thing for 20+ years but it’s only the last couple of years where Python caught up.
Final thing I learned was a bit of scikit-learn, in particular how to apply its PCA and K-Means algorithms. Went pretty smoothly. My only annoyance is it seems to only half-support Pandas. It will take DataFrames as inputs but then the objects it returns are basic numpy Arrays which lack column names and some of the nice slicing options. I can’t tell if I’m doing it wrong, it has to be this way, or if it’s just that Pandas is so new that scikit-learn hasn’t fully adopted it yet.
Anyway, all in all my experiment was a big success, very happy. Fun stuff!