Darkest Hour: Part 1 – Getting data for Darkest Hour

One of my favorite games to play is called Darkest Hour. It is a mod based on a realism world war 2 first person shooter from 2006. The game can be so intense at times that I like to joke that I go to work to relax. The community is small, but loyal with regulars playing every night, and a devoted international development team still pushing out new maps, weapons, tanks, and game play systems.

Oh, and they’re doing this all for free.

A cinematic trailer for a map based on Fury (2014).

Before I get too gushy, one of the tools the devs made was a statistics page that gets updated on some regular cadence. I have always wanted to learn how to scrape data from the web, put it into some statistics software like R or SPSS, and see if I could glean any insights.

The statistics page for Darkest Hour where players are ranked based on overall kills.

The animation above shows several interesting statistics – Kills, deaths, K:D ratio, team kill count, time played. If you click on any player, then you get additional metrics such as when people play, and their deadliest weapons.

My deadliest weapons.

For example, the weapon that I have the most kills with is the 30 cal machine gun which can be used as an American machine gunner, or on several American tanks. After that, I favor the German Kar 98, then some tanks and other machine guns.

I used to think of myself as a rifleman in the game. I’d use a speciality weapon if the situation called for it, but otherwise, I prefer the accuracy of the rifles. Over time, I started feeling more confident in tanks. I got good enough that RafterMan wouldn’t yell over the coms about noobs being in tanks. Then other people started to recognize that I was an above average tanker, and I embraced the persona with my in game name – Sherman Jesus.

One analysis idea would be to see if I can group players based on their favorite weapons. Can you reliably identify tankers based on their data? Do people have detectable preferences on the different factions (German, Americans, British, and Russian)? Are there weapons that newer versus experienced players prefer? All this is getting a bit ahead of myself because you can’t do any of this unless you find a smart way to get the data.

Data scraping is a technique in which a computer program extracts data from human-readable output coming from another program.

thank you wikipedia

Data scraping is basically getting access to data that you don’t have direct access to. There are technical considerations such as how the data is stored (java, api, html) or structured (xml, tidy, crazy), but there are also legal considerations. It is possible to scrape the wrong thing and get into some pretty serious trouble. I am far from an expert, so go see other resources online to learn about robot.txt and fair use. For my purposes, the lead developer gave me permission to do this. As long as i don’t break anything, we should be fine.

No take backsies.

Problem 1: How the hell do I download the data I want?

Pulling data from Java is difficult.

Most web scraping tutorials teach you how to pull data from HTML tables. These are tables that are static, and if you click on the next page, you get a new web page with new data. Go to IMDB.com and that’s how they have things structured. The data I want access to is locked behind some Java table api stuff.

Java script is a whole other beast. Assuming you want more than the top 25 players, when you click on “next” on the stats page the URL doesn’t change to “page2/” or something. The java script loads the data every time form another database. There isn’t an easy way to write a program that collects the data, clicks on the next 25 players, and repeat 500 times.

I don’t know, maybe there is, and I am too new and dumb to figure it out.

I ended up tinkering around with the stats page and various R packages that would help me out when I came across the network tab in Chrome’s inspect element tool. When I change features of the stats table, the table requests data from another website that looks like an IP address.

This is what I want! Yes yes yes yes yes.

This website has the exact features that I need in order to scrape data. The data looks organized and predictable. When I want an additional 25 players, there is a link at the top that I can have a loop click through and get the top 1,000 players easily!

Well… easy in theory.

Over the course of the next week, I’ll bang my head against the wall to try to get this all into a tidy table in R.

Buy Red Orchestra and download Darkest Hour today!