League of Legends stats: some findings

I’ve been noodling with some League of Legends data as an excuse to learn more data analysis tech. I’ve blogged about the software, here’s the actual results I found.

League of Legends is a huge game, some 100M a month playing 1M+ games a day. And the data for those games is readily available. I took a look at third party summaries of game statistics from champion.gg and na.op.gg and tried to divine some meaning. The dataset mostly is for ranked games by Platinum+ skilled players in North America for patch 6.17. But I also looked at players of different skill and trends across game patches.

The most interesting conclusion I found was that some champions win much more in the hands of skilled players. Nidalee, for instance, has a 45% win rate in Silver but a 54% win rate in Diamond. That’s not exactly a surprise, she’s known to have a high skill cap, but it’s good to see hard data. And there are surprises; I would not have guessed Pantheon also was highly improved or that Dr. Mundo suffered at higher skill levels.

Despite looking, mostly I didn’t find any unusual trends or big shifts in the game over patches. To me that’s a testament to the quality of the game, in particular Riot’s game balance team.

Below find some analysis and graphs. I’ve put my notebook output online for viewing here:

Basic stats for patch 6.17

The first thing I did was look at the basic champion.gg stats for patch 6.17. Each row in my dataset is one summary line for all players of a champion in a role. For instance Nami / Support. Note there’s a bit of bias here; champion.gg omits rare cases like Nami / Jungle.

There’s relatively little diversity in ADC and Support champions. Simple result: there are only 17 regularly played ADC champions and 25 Supports. There’s ~50 champions played Top/Middle/Jungle. This fact skews a lot of my later statistics, particularly with the ADC pool where a single champion (Lucian) dominates with a 32% play percentage. The top mid, by contrast, is Ahri with only 13%.

download (1).png

Players generally don’t pick champions with higher win rates. This surprised me a bit; I thought there’d be a clear bias towards the OP champs. There isn’t.

download.png

Players do tend to win more on champions they have more experience with. No big surprise here, but more games = more wins. The stat “experience” here is average games played on that champion. (It’s in the champion.gg dataset, but not displayed on the website.) The correlation coefficient is +0.24 with a p value of 0.0005, which seems likely to be significant to me. The causation is less clear. For example Galio-Top players have an average of 107 games on the champ and a 55% win rate. Do people win on Galio because they play him a lot, or do they play him a lot because they win? (Or in the specific case of Galio; do relatively few people play him but the ones who do are really good?)

download.png

Stats for players of different skills from na.op.gg

I was curious how these statistics varied by tier, so I went to na.op.gg and downloaded stats for players at all skill levels. I mostly focussed on Silver through Diamond, since those are the most “normal” player populations. This data is for the month of Aug 15 – Sep 15 2016 and spans patch 6.17 and 6.18.

Some champions have much higher win rates at higher tiers. This result is the most interesting I’ve found. I feel like each player at the extremes has a story. I’m not surprised that Nidalee’s win rate goes up 9.37% from Silver to Diamond; she’s hard to play well and so presumably does better in the hands of a skilled player. Aurelion Sol goes up 5.8%, maybe because he depends on skilled teammates? But why does Pantheon go up 7.5%? He seems so simple! I’m guessing Aatrox and Garen do worse in higher tiers because players know how to exploit their weaknesses. And Amumu’s 4.1% drop may be as simple as Diamond players being less vulnerable to Amumu cheese. But there’s more to say here and I’d like to develop this data further. Here’s the 10 most improved and 10 who suffer the most.

Screen Shot 2016-09-21 at 4.32.54 PM.png

Some champs are significantly more popular at higher tiers. The companion result to win rates; what about pick rates by tier? Janna is the most more frequently picked, she goes from 0.8% to 2.9%. (As a frequent Janna player my theory is she is much more rewarding to play when your team is more skilled.) Leona drops from 1.6% to 0.6%. The ADC champion pool attenuates at higher tiers too; Lucian dominates and Jinx drops off, for instance.

Screen Shot 2016-09-21 at 4.37.11 PM.png

Silver-to-Diamond win rate is correlated with Silver-to-Diamond pick rate. In the previous section I reported that for Platinum players, win rate is not correlated with pick rate. However the champs that have the most improvement rate from Silver-to-Diamond also tend to have higher pick rates from Silver-to-Diamond. This suggests that higher tier players are picking champions that have more impact at their skill level. Well maybe; I have a suspicion this effect is pretty small.

download (2).png

 

Trends across patches

Finally I was curious how stable the game statistics were over patches. I collected historical data from champion.gg’s GitHub going back to patch 5.10, over a year ago. These are all ranked Platinum games.

Stats are pretty stable between patches. In general there’s not a lot of change in the game since 5.10. For instance, players averaged a 2.55 KDA in 5.10 and a 2.48 KDA now in 6.17. There is a general trend up in some of the cumulative stats like “goldEarned” or “totalDamageTaken”. That could easily be a symptom of games just taking longer over time; unfortunately I don’t have stats on game length to normalize with. Two highlights though. Experience has a big drop at the beginning of Season 6. I take this to be new accounts and/or people changing up how they play. Also totalHeal had a big bump up around patch 5.22. That could be the same general cumulative stat effect, but I think it may also indicate the addition of more healing abilities and the rise of healing supports like Soraka and Nami.

download (2).png

Champion diversity is pretty stable between patches. Up above I showed the distribution of playPercent across 5 roles for patch 6.17; here’s an animation of that distribution across patches. It’s pretty stable. So are the distributions for all the metrics I’m most interested in; goldEarned, KDA, winPercent, in general the distribution across champs looks the same between patches. I find that a bit surprising because I think of LoL game balance being pretty dynamic with a constantly changing meta. But it’s not really reflected in the stats.

 

Better Jupyter charts: animation

Well this is neat; matplotlib supports generating animations, and you can inline HTML5 video right in Jupyter. The only wrinkle I hit is the stupid Ubuntu ffmpeg/avconv kerfuffle. Trying to produce the animation by default produced the error “KeyError: ‘ffmpeg'”; I solved it by setting

matplotlib.rcParams[‘animation.writer’] = ‘avconv’

There’s a GitHub comment saying that avconv may not be reliable, but it seems to be working for me in Chrome.

Now to figure out how to make this work with Seaborn. Some hints here.

Update: the Seaborn hints mostly just worked. Hilariously the inline video solution is to encode the whole MP4 file in a data64 blob right in the HTML, some 130k worth in my case. Hey, it works.

Better Jupyter charts: Seaborn

Took a quick tour through Seaborn, the enhancement library for matplotlib. It’s very good! It does two basic things. It makes the default matplotlib charts prettier, and it gives you an easy API to do some fancier types of statistical visualization. I’d say it’s a no-brainer to use Seaborn if you’re doing exploratory data visualizations. It still just renders raster images in Jupyter, so it doesn’t fulfill my goal of having nice SVG interactive charts in the browser, but Seaborn is solving a different problem.

Prettier

“Drawing attractive figures is important”, the docs say, and I couldn’t agree more. Seaborn reconfigures matplotlib so the default charts look better. I don’t just mean nice anti-aliasing, but also reasonable grid ticks and color choices. Seaborn has good perceptual palettes which are really important. I believe stock matplotlib has recently improved in part with input from Seaborn.

The other part of “attractive figures” is the Seaborn API is DataFrame-aware and will label your plots using the labels in your DataFrame. Getting nicely labelled axes and titles and stuff takes several lines of manual code with matplotlib; with Seaborn it’s a single line of code.

Fancier statistics

The meatier part of Seaborn is it has more complex chart types built in.

Distributions are mostly what I’ve used. In detail:

  • distplot: 1 dimensional distributions, an enhancement of matplotlib.hist. It adds a continuous kernel density estimate to the bars, and also has a rug-plot option.
  • jointplot: 2 dimensional distributions, an enhancement of matplotlib.scatter. Adds correlation coefficient, histograms on the side, a sort of quicky ggplot. Can also do continuous contour plots. One drawback is there doesn’t seem to be any easy way to set the hue of the individual dots, so no sneaking in a third dimension of data. Pass in kind=”reg” to have it try to fit a regression curve to the scatter.
  • hexbins: 2 dimensional histograms
  • pairplot: pairwise jointplots for your N dimensional dataset. So nice to have this in a single command, although it’s unwielding for N > 5 or so. (Note this does have a way to set the hue of individual dots.)

Regressions are for fitting various kinds of statistical models to your dataset.

Categorical graphs are for looking at statistics for qualitative categories, plots like swarm plots and violin plots. They’re really quite beautiful. Here’s a sample visualization of the distribution of a variable (y axis) for each of 5 categories (the x axis). I’ve used a violin plot which shows a KDE continuous approximation, along with a swarm plot to show the actual dots on top of it. (The data set is # of games players have for each of 5 roles in League of Legends. That one crazy outlier is Fiddlesticks Jungle; players tend to have 220+ games on him!)

download.png

Finally data-aware grids are used for making small multiples plots of the same dataset. I already mentioned pairplot, for doing NxN visualizations of N variables, pairwise. FacetGrid is a tool for quickly generating a bunch of graphs comparing across multiple categorical variables.

Conclusion

I like Seaborn. I like that it looks good out of the box. I also like that it allows me to make more sophisticated graphs very simply, with little effort. It may be a little too easy; I’m not sure a violin plot is really the right treatment of that data above, for instance. But it sure looks good!

Better Jupyter charts: mpld3

As much as I’m liking Jupyter, I don’t much like that matplotlib is the default charting library. I mean matplotlib gets the job done, and it’s powerful and popular. But end of the day you’re rendering raster images with 10+ year old software. (Even if seaborn has made the raster images better.) We’ve got a fancy browser viewer, why not have fancy SVG/Javascript graphs?

That’s what mpld3 provides. Using it is as simple as importing it and calling mpld3.enable_notebook(). All (most?) of your old matplotlib API code will still work. mpld3 emulates that charting API but renders to SVG using D3, instead of rendering to a static image. It’s kinda nutty that it just works. The resulting pixels look good, although admittedly matplotlib’s raster backend (agg?) looks pretty good too. 

The real advantage is the graphs are manipulable with Javascript. Out of the box every graph is now pannable / zoomable, right in the browser window. You can do fancy linked brushes across multiple figures.

There’s also plugins to add more capability. PointLabelTooltip plugin is particularly nice, it adds an easy way to have HTML tooltips on a scatterplot when you hover the points. There’s also MousePosition (useful for images).

And… and that’s about it. After a bunch of work a couple of years ago development seems to have stalled. That’s OK, it’s useful as is. Particularly since it’s such a simple drop-in upgrade for matplotlib.

Probably time to look at alternatives. Bokeh is the “better matplotlib” and includes a bunch of browser / notebook stuff. However it’s still rendering raster images (albeit in an HTML5 canvas). The examples sure are pretty though. I also see frequent references to VisPy, which is GL based, but although there’s some noises made about WebGL and notebooks I can’t find a working demo. Other options I’ve run across are pygalbqplot and Altair. People also talk about Plotly but it’s a hosted system and costs money.

Some other thoughts on this topic: The future of visualization in Python, In defense of Matplotlib, Overview of Python Visualization Tools

While I’m here, ipywidgets is kind of nutty. It lets you add interactive controls to a notebook like sliders and input boxes, to turn your notebook into an interactive app.

 

Pandas DataFrames: MultiIndex and slicing

Diving more in to using Pandas DataFrames, I spent some time learning about MultiIndex. Long story short it’s a way to have a composite key for your data, to say “these two columns of my CSV file are the name for the row”. Or more complex things.

Pandas has some fairly powerful mechanisms to subset your DataFrame based on aspects of its MultiIndex composite key. They’re a bit confusing though, the slicing syntax is abstruse. Also there’s a hidden gotcha: you really have to sort your DataFrame before you can slice it if it has a MultiIndex.

Anyway, there’s a demo notebook program here. I also uploaded the ipynb file as a gist but GitHub’s viewer is buggy.

For the search engines: if you see any of the following error messages, this sample notebook will help you figure them out. Summary; call DataFrame.sortlevel()


PerformanceWarning: indexing past lexsort depth may impact performance.

KeyError: 'Key length (1) was greater than MultiIndex lexsort depth (0)'

KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (1)'

PS: if you want to get the series of index labels for a DataFrame with a MultiIndex, use get_level_values(). I keep forgetting how to do this.

How to win Stack Exchange: come early

Screen Shot 2016-09-17 at 9.33.13 AM.pngI’m in the top 6% of Stack Exchange users, despite seldom contributing to the site. How? I got in early.

I didn’t do anything clever. I just happened to have asked a well-worded question seven years ago that’s been popular with Google searchers ever since. It helps that I got a great, surprisingly simple answer.

It just keeps paying off, a reliable karma trickle. Something like half of my reputation points come from the one question.

I think the Stack sites are terrific, btw. I now go to Stack Exchange first for questions on new libraries like Pandas, before I even go to the code’s own documentation.

Notes from a Jupyter/Dataframe project

I’m looking at some League of Legends data; champion win rates by player tier. I’m curious which champions have higher win rates with skilled players, and/or which are more popular. You can see the resulting report here, notebook (without nice CSS) is here.

Learning me some Jupyter and Pandas

Really I’m using this project to get more comfortable with using Jupyter and Pandas DataFrames to analyze data. The DataFrames part is overkill; I’ve got about 10k of data, not the 10G of numbers DataFrames’ highly optimized code is for. But DataFrames have lots of functions that replicate SQL’s analysis capabilities and they display nicely in Jupyter, so here I am.

My data starts as web pages of HTML tables. I imported those into a single Google Spreadsheet with multiple tabs.For a single sheet, rows are champions (120 or so) and columns are statistics like win rate or games played (5 of them). There’s 7 sheets, one per tier Bronze / Silver / Gold / … I then exported those sheets as CSV files, one per tier, for analysis. (Google Spreadsheets turns out to be a remarkably good HTML table data scraper.)

I then played with various forms of Python code for reading those CSV files and doing some statistics on them. Starting with simple raw Python data structures, a bunch of dicts and tuples all the way up to a full Panda Panel. Along the way I learned how to analyze the data using Pandas a bit, and to format the results nicely in Jupyter. My real goal here is to make interactive reporting as natural as if I’d loaded all the data into a SQL database. Being able to do natural things like sum all values, or get the average of numbers grouped by some criterion.

Data loading

One thing I found is that the more I whipped my data into Pandas shape, the simpler my code got. I finally settled on doing all the parsing way up front, when I read the CSV files, to make clean DataFrame objects with well named columns and rows. That makes the later analysis code much simpler.

The read_csv() function has a lot of options to configure parsing, it’s worth the time to set them. “thousands=,” for instance will allow it to parse things like “1,523,239” as an integer and not a string. I also ended up doing “header=0” and specifying my own “names=[…]” for columns, just so I could make the names valid Python identifiers. (You can access named things in a DataFrame with attribute syntax; either df[“winrate”] or df.winrate.)

But the most important thing for read_csv was index_col. The concept of index in a DataFrame is particularly important. That’s the unique key that names each row; in my data the index is the champion name. In a CSV file this would probably be the column named “label” or the like. In a database table it’d be the primary key. The word index is used all over the docs and it confuses me all the time. Once I set up my data index correctly the code got a lot better.

The thing I find most confusing is the shape of data. DataFrames, for instance, mostly seem to want to treat a column of numbers as a single feature. But I really need to work row-rise most of the time. Slicing a DataFrame is particularly bizarre. df[n] gives you the nth column from the dataframe df. But df[n:n+2] gives you rows n to n+2. I find I often have to transpose my dataframe to make operations convenient, or use the “axis” parameter to functions like sum() to indicate whether I want to sum over columns or over rows.

The final fillip was using a Pandas Panel to store my data. A Panel is a collection of DataFrames; in my case one per tier. This 3 dimensional data structure is a bit confusing, the docs apologize for the names and it’s not clear that Panel adds a lot of value over a dict of DataFrame objects. But it’s Pandas’ way of representing a collection of spreadsheets, so I am going with it. My panels “items” are the tier names, the major_axis is champion rows (named by the index) and the minor_axis is the stats (the columns.)

Data analysis

So now I have my data loaded into a panel, time to analyze. The pattern I’ve settled on is generating a new DataFrame that’s a report table. For instance I’m interested in average statistics per tier. So I start with a 3d Panel of tiers x champions x stats, and I boil that down to a 2d DataFrame of tiers x average(stats). When I do ad-hoc Python report generation I usually just iteratively print out report lines, or maybe make an array of dicts for my data. Those are still fine too, but I wanted to go full Pandas.

I started out by writing lots of laborious loops over panel.iteritems() and the like. See cell 7 and how I first build up a dict named ‘spreads’ and then converted it to a DataFrame. Over time I’ve refined to simply doing operations like dividing a DataFrame by a Series and the like. Cell 8 shows that style, the creation of pick_rates is about the same computation as spreads before, but it’s less code. The whole process feels a lot like vectorizing code, like I did for my machine learning class in Octave/MatLab. The code gets shorter, which is nice, but it should also be more efficient which matters for large datasets.

Presentation

My final goal was to make nice-looking reports right in Jupyter. That proves to be quite difficult. Pandas is amazing at inlining graphs and visualizations with matplotlib and friends. But it’s not quite as good at textual reports. I mean Jupyter will display any HTML you make in python, but I haven’t found any good HTML generation convenience libraries out there for, say, DataFrames.

Fortunately the default DataFrame presentation HTML is pretty good. Labelled columns and rows, decent formatting of numbers. All that’s really missing is right alignment of numbers (and why is that not the default?!) But I want more control, for instance rounding floating point numbers to specific precision. For that I used the DataFrame.style object, which gives you a D3-esque ability to apply per-datum CSS styles to cells. I mostly used .format() to control text formatting of numbers and .set_table_styles() to add some CSS to the HTML table. The .applymap() function is the really powerful one that lets you control things per-datum.

I also threw in the background_gradient style, to give a sort of heatmap coloring to the number cells. That required installing a bunch of crap like python-tk and numpy just to get the colormaps. But it’s pretty painless in the end.

Conclusion

Pandas + Jupyter is a pretty good platform for ad hoc exploration of data, and with some learning and effort you can produce pretty clean code and reports. I love the way notebooks are reproducible and shareable. I think using SQL for this kind of work is probably more effective, but there’s a lot of overhead in getting set up with a database. And the interactive exploration isn’t as nice as a Jupyter notebook. (Hmmm, there’s no reason Jupyter cells couldn’t contain SQL code!)