Machine Learning and Kaggle

Kaggle is a site that hosts different Machine Learning competitions. This blog entry is about the first Kaggle I competed in, what I did and the results thereof (#17 out of 1528).

This blog entry also exists in Swedish, just because I could.

Kaggle describe themselves thusly;

The world’s largest community of data scientists compete to solve your most valuable problems

AXA Driver Telematics Analysis

The goal of AXA Driver Telematics Analysis was to  “Use telematics data to identify a driver signature”. In short, we were giving a huge amount of “drives” (described by GPS coordinates) that had been performed by a large number of drivers. Entries were scored on how well they were able to identify who was driving the car.

There were 1,528 participants in the competition and I was able, after a lot of work, to place at #17. Out of the 16 entries that scored higher, 7 were teams and the rest were by individuals. To my mind, being beaten by a team hurts less than being beaten by an individual. And scoring well as an individual is cooler than scoring well as team. But maybe that’s just me.

I chose to develop all my code by myself, not using any Machine Learning packages in the process. If that seems impractical,  then rest assured that it really really is. But that’s how I learn. Find a cool algorithm and implement it from some kind of paper or Wiki description. It’s often tricky, but I always learn stuff! For this competition, I was trying out some deep learning stuff before it all ended up in tears (see below). But I had to develop my own deep learning (Neural Network stuff) for that to work. Before I threw it all out, that is.

Anonymizing the data

The goal for AXA, who put up $30,000 in winnings (plus whatever money they paid Kaggle for administrating the competition) was to  try to identify when the “wrong” driver was driving a vehicle. For instance when the vehicle had been stolen or lent out.

Kaggle/AXA had mixed the drives from a given driver with random drives from other drivers (incorrect drives). The goal was to identify drives that didn’t belong to the driver. The incorrect drives, having been driven by someone random, were most likely performed in another city or part of the country from the actual drives. So to a lot of drives that were virtually guaranteed to be from the correct driver, you could find parts of the drives that looked exactly the same. When the driver goes to or from work or the local store. Those parts of routes would turn up over and over again, making it easy to determine that they were “positives” (actual drives).

Kaggle/AXA predicted this and masked all the GPS coordinates as relative coordinates measured from the start of the drive. They also removed a short snippet from the start and the end of the drives.Then they mirrored 50% of the drives and rotated them a random angle between 0-360 degrees. All this to make it harder to a) mask identical parts of trips  and b) prevent Kagglers from identifying the actual address from any of the drivers. B was done to preserve the privacy of their customers.

Two identical drives (given that you could drive exactly the same twice, including timing) would look radically different from each other. The image above shows all t he drives from a specific driver. Note how they go off in different angles and never overlap. That’s of course not how it would look in reality, most drives originate from home and go to a specific spot (work) or the reverse (work->home).  Without mirroring and rotation, several routes would overlap.

A model that would be useful to AXA would base it’s predictions on other metrics, like the speed, how speed tends to change (acceleration and retardation), how the driver handles curves and so on. For that reason, AXA/Kaggle tried to force Kagglers to use that kind of data.

It all ended in tears

It all went terribly wrong, though. No models could place well using the metrics useful to AXA. It turned out that the only way to compete was to try to reverse the masking/anonymizing performed by AXA/Kaggle to figure out which drives were performed in areas typical to the main driver and which weren’t (called Trip Matching). If no one had used Trip Matching then the best strategies would possibly have been useful to AXA. But once one Kaggler used it, there was no going back. To compete, we all had to do it. And it turned out that Trip Matching was a fun problem to solve, just not very useful.

The picture below shows trips that have been matched up (by another Kaggler) and how they obviously were performed in the same area. Thus, they belonged to the true driver and could be given a high score.

My solution was very similar to what they describe; instead of comparing coordinates, the route was converted to turns (left and right) and these turns were aligned to other drives trying to find matches. The Kaggler with the best trip matching algorithm would win! You could try to mix this result with some actual Machine Learning, but that typically wasn’t what decided the result.

The reason this isn’t helpful to AXA is that they already knew which drives matched each other in world space, figuring that out didn’t help them. The winning solution was whoever was better at reversing the AXE/Kaggle obfuscation. Decidedly not helpful, but that what the competition became. I’m absolutely not saying this was wrong of the Kagglers – or that Kaggle or AXA did anything wrong. It’s just that we found a weakness and exploited it. This happens quite frequently and is something Kaggle/the competition owners try to prevent by masking the data. It’s not against the rules to exploit weaknesses though.

Processor power

This competition required huge amounts of processing power. The data ran into gigabytes and every hyperthread was sorely needed. Each new iteration of my algorithm took about 24 hours to run. So I would change one thing and run my tests. 24 hourse later I would evaluate the result of that change and either revert it and make another change, or move forward with the new change tweaking something else. Imagine playing tennis with one day between the hits – it’s hard to get a good creative flow.

Iterating quickly is actually extremely important, because it allows you to test many more variants – but in this case (at least for me) that was not to be. Since this competition I’ve bought a faster computer (6 cores, 12 hyperthreads and 32GB or RAM), but already my new computer has too little RAM for the competitions I’m running.

At the end of this competition, I was renting three large cloud servers (Microsoft Azure) to be able to evaluate several solutions in parallel.

Why I placed where I did

There are three main ingredients to my recipe

  • I was lucky when picking my method. There were thousands of routes I could have gone down, many of which would have been worse. Unfortunately, there’s rarely a clear sign of which way to go forward – as there often is in software development. In these ML competitions, I find I prototype solutions and continue with whichever route seems most fruitful. That very likely leaves you following a sub-optimal route and forcing you to backtrack. In this case, I remember very vividly where I went wrong and how I would have placed higher had I gone a different route at that point. Oh, well.
  • I’ve been programming for 30 years and no algorithm is too difficult to implement. When I have an idea, I can implement and evaluate it. And I can do this very quickly allowing me to rapidly iterate.
  • I worked very hard at this problem. I probably would have made more money working as a consultant for the hours I spend than I had if I won the competition… But the amount I learned was well worth the investment!

Conclusion

Find a method that allows you to iterate quickly. Automate as much as possible. Precompute as much as possible. Find a method that allows you to accurately validate your performance offline, so you don’t have to waste submissions to figure out if you’re doing well.

Right now, I’m competing in a Kaggle related to the travel industry – where I’m trying to evolve a solution. I could use 100* the CPU/Disk power I currently have and still not have enough. But I’m learning every step of the way, and I’m enjoying it immensely! And currently I’m falling quickly in the ratings, but I’m hoping on this last run! Or the one after that…

About mfagerlund
Writes code in my sleep - and sometimes it even compiles!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: