Maskininlärning och Kaggle-Swe

[See this in English]

Kaggle är en sajt som hostar olika tävlingar inom maskininlärning. Det här blogginlägget kommer beskriva den första av de kaggletävlingar som jag deltagit i, vad jag gjorde och hur resultatet blev.

Kaggle beskriver sig själva som

The world’s largest community of data scientists compete to solve your most valuable problems

AXA Driver Telematics Analysis

AXA Driver Telematics Analysis gick ut på att “Använda telematisk data för att identifiera en förarsignatur”. Kort sagt så¨fick man tillgång till en enorm mängd bilturer (beskrivna med GPS koordinater) som utförts av olika förare. Olika bidrag bedömdes på hur väl man lyckades identifiera vilken förare som utfört vilken biltur.

I tävlingen deltog 1528 deltagare och jag lyckades, efter mycket arbete, komma på plats 17. Av de 16 bidrag som kom före mig så lämnades 7 in av team och resten av enskilda individer. Jag valde att utveckla all min kod själv, från scratch, vilket kan verka opraktiskt, men det är så jag lär mig. Hitta en cool algoritm och sedan implementera den själv.

Anonymisering av data

Målsättningen för AXA, som betalade ut $30,000 i vinstpengar (plus pengar till Kaggle för att administrera tävlingen), var att försöka identifiera när “fel” förare framför ett fordon. Om det t.ex. blivit stulet eller utlånat.

Kaggle/AXA hade blandat upp körningar från en given förare med slumpvis valda körningar från någon annan förare. Målet var att identifiera de körningar som inte tillhörde huvudföraren. Detta innebar att dessa körningar utfördes i en annan stad än de körningar som den egentliga föraren utfört. Så för att hitta missmatchande körningar så räckte det att hitta körningar som kom från ett annat område och peka ut dessa som de felande körningarna.

Kaggle/AXA hade förutspått detta, så de gjorde om alla GPS koordinater till relativa koordinater, mätta i meter, med 0-punkten satt där körningen började. Utöver detta så klippte de bort ett stycke i början och ett stycke i slutet av varje körning. Sedan spegelvändes hälften av körningarna och de roterades mellan 0 och 360 grader. Detta för att försvåra att a) identifiera området och därmed göra tävlingen för lätt och b) identifiera den verkliga adressen som personen kom ifrån/åkte till. B utfördes för att skydda de personer som ingick i underlaget.

Så två identiska körningar (hur man nu åstadkommer detta) skulle se radikalt olika ut i underlaget. Bilden ovan visar de körningar som en speciell förare utfört. Lägg märke till att de åker slumpvis åt alla håll – så ser det förstås inte ut i verkligheten, de flesta åker hemifrån till vissa fasta punkter och tillbaka igen, så många körningar borde överlappa.

En modell som AXA kunde ha nytta av kan inte vara baserad på var föraren kör utan endast *hur* föraren kör, hur fort personen tar kurvor, hur personen ökar och minskar hastigheten etc. Därför försökte AXA/Kaggle se till att endast denna information gick att utvinna ur data.

Allt slutade i tårar

Det hela gick dock förfärligt snett så att inga utav de modeller som placerade sig högt gick att använda till något användbart. Man identifierade snabbt att det enda sättet att få konkurrenskraftiga resultat var att ändå försöka hitta vilka körningar som matchade andra körningar i databasen, detta kallades Trip Matching och bilden nedan visar ett antal trippar som matchats upp mot  varandra av en annan Kagglare. Min lösning påminner mycket om den lösningen de beskriver. Det faktum att rutten roterats gör att man istället för att titta på koordinaterna tittar på svängar som utförts och sedan försöker man hitta andra rutter som under en viss period har liknande svängar. Den som hade bäst Trip Matching metod vann. Man kunde försöka matcha resultatet med lite verklig Machine Learning, men utan bra Trip Matching så var man såld.

Processorkraft

Det krävdes enorm processorkraft för varje test som utfördes. Varje ändring av koden behövde köra igenom alla tripper för att skapa sitt resultat och mot slutet, när koden var som mest avancerad, så tog detta minst ett dygn. Så det handlade om gör en liten ändring, vänta ett dygn, gör en ny liten ändring, vänta ett dygn. Sedan dess har jag köpt en snabbare dator (6 cores, 12 hyperthreads med 32 GB RAM) men även den nya datorn har för lite RAM för de tävlingar jag jobbar med nu.

På slutet av denna Kaggle så hyrde jag in 3 stycken stora burkar på molnet (Microsoft Azure) för att kunna testa flera olika varianter parallellt…

Varför jag placerade så högt

Om sanningen skall fram så finns det tre huvudingredienser i mitt recept

  1. Jag hade tur när jag valde metod, det fanns tusentals andra där många av dessa hade varit sämre. Tyvärr finns det oftast ingen tydlig riktning att gå i när man får dåligt resultat. Så man måste testa åt ett par olika håll och fortsätta åt det håll som verkar mest lovande, men man famlar i blindo för att varje körning tar så lång tid.
  2. Jag har programmerat i 30 år så ingen algoritm är för krånglig för att implementera. Hade jag en idé kunde jag implementera och provköra den
  3. Jag nötte som *fan*
    Oftast så försöker jag bygga en metod som låter mig iterera rykande fort, för att jag aktivt skall kunna testa olika varianter, det blir mer kreativt och man minns mellan körningarna vad man var ute efter att göra. I den här tävlingen var det svårt.

Just nu tävlar jag i en Kaggle som handlar om resebranschen där jag försöker evolvera fram en lösning.

Machine Learning and Kaggle

Kaggle is a site that hosts different Machine Learning competitions. This blog entry is about the first Kaggle I competed in, what I did and the results thereof (#17 out of 1528).

This blog entry also exists in Swedish, just because I could.

Kaggle describe themselves thusly;

The world’s largest community of data scientists compete to solve your most valuable problems

AXA Driver Telematics Analysis

The goal of AXA Driver Telematics Analysis was to  “Use telematics data to identify a driver signature”. In short, we were giving a huge amount of “drives” (described by GPS coordinates) that had been performed by a large number of drivers. Entries were scored on how well they were able to identify who was driving the car.

There were 1,528 participants in the competition and I was able, after a lot of work, to place at #17. Out of the 16 entries that scored higher, 7 were teams and the rest were by individuals. To my mind, being beaten by a team hurts less than being beaten by an individual. And scoring well as an individual is cooler than scoring well as team. But maybe that’s just me.

I chose to develop all my code by myself, not using any Machine Learning packages in the process. If that seems impractical,  then rest assured that it really really is. But that’s how I learn. Find a cool algorithm and implement it from some kind of paper or Wiki description. It’s often tricky, but I always learn stuff! For this competition, I was trying out some deep learning stuff before it all ended up in tears (see below). But I had to develop my own deep learning (Neural Network stuff) for that to work. Before I threw it all out, that is.

Anonymizing the data

The goal for AXA, who put up $30,000 in winnings (plus whatever money they paid Kaggle for administrating the competition) was to  try to identify when the “wrong” driver was driving a vehicle. For instance when the vehicle had been stolen or lent out.

Kaggle/AXA had mixed the drives from a given driver with random drives from other drivers (incorrect drives). The goal was to identify drives that didn’t belong to the driver. The incorrect drives, having been driven by someone random, were most likely performed in another city or part of the country from the actual drives. So to a lot of drives that were virtually guaranteed to be from the correct driver, you could find parts of the drives that looked exactly the same. When the driver goes to or from work or the local store. Those parts of routes would turn up over and over again, making it easy to determine that they were “positives” (actual drives).

Kaggle/AXA predicted this and masked all the GPS coordinates as relative coordinates measured from the start of the drive. They also removed a short snippet from the start and the end of the drives.Then they mirrored 50% of the drives and rotated them a random angle between 0-360 degrees. All this to make it harder to a) mask identical parts of trips  and b) prevent Kagglers from identifying the actual address from any of the drivers. B was done to preserve the privacy of their customers.

Two identical drives (given that you could drive exactly the same twice, including timing) would look radically different from each other. The image above shows all t he drives from a specific driver. Note how they go off in different angles and never overlap. That’s of course not how it would look in reality, most drives originate from home and go to a specific spot (work) or the reverse (work->home).  Without mirroring and rotation, several routes would overlap.

A model that would be useful to AXA would base it’s predictions on other metrics, like the speed, how speed tends to change (acceleration and retardation), how the driver handles curves and so on. For that reason, AXA/Kaggle tried to force Kagglers to use that kind of data.

It all ended in tears

It all went terribly wrong, though. No models could place well using the metrics useful to AXA. It turned out that the only way to compete was to try to reverse the masking/anonymizing performed by AXA/Kaggle to figure out which drives were performed in areas typical to the main driver and which weren’t (called Trip Matching). If no one had used Trip Matching then the best strategies would possibly have been useful to AXA. But once one Kaggler used it, there was no going back. To compete, we all had to do it. And it turned out that Trip Matching was a fun problem to solve, just not very useful.

The picture below shows trips that have been matched up (by another Kaggler) and how they obviously were performed in the same area. Thus, they belonged to the true driver and could be given a high score.

My solution was very similar to what they describe; instead of comparing coordinates, the route was converted to turns (left and right) and these turns were aligned to other drives trying to find matches. The Kaggler with the best trip matching algorithm would win! You could try to mix this result with some actual Machine Learning, but that typically wasn’t what decided the result.

The reason this isn’t helpful to AXA is that they already knew which drives matched each other in world space, figuring that out didn’t help them. The winning solution was whoever was better at reversing the AXE/Kaggle obfuscation. Decidedly not helpful, but that what the competition became. I’m absolutely not saying this was wrong of the Kagglers – or that Kaggle or AXA did anything wrong. It’s just that we found a weakness and exploited it. This happens quite frequently and is something Kaggle/the competition owners try to prevent by masking the data. It’s not against the rules to exploit weaknesses though.

Processor power

This competition required huge amounts of processing power. The data ran into gigabytes and every hyperthread was sorely needed. Each new iteration of my algorithm took about 24 hours to run. So I would change one thing and run my tests. 24 hourse later I would evaluate the result of that change and either revert it and make another change, or move forward with the new change tweaking something else. Imagine playing tennis with one day between the hits – it’s hard to get a good creative flow.

Iterating quickly is actually extremely important, because it allows you to test many more variants – but in this case (at least for me) that was not to be. Since this competition I’ve bought a faster computer (6 cores, 12 hyperthreads and 32GB or RAM), but already my new computer has too little RAM for the competitions I’m running.

At the end of this competition, I was renting three large cloud servers (Microsoft Azure) to be able to evaluate several solutions in parallel.

Why I placed where I did

There are three main ingredients to my recipe

  • I was lucky when picking my method. There were thousands of routes I could have gone down, many of which would have been worse. Unfortunately, there’s rarely a clear sign of which way to go forward – as there often is in software development. In these ML competitions, I find I prototype solutions and continue with whichever route seems most fruitful. That very likely leaves you following a sub-optimal route and forcing you to backtrack. In this case, I remember very vividly where I went wrong and how I would have placed higher had I gone a different route at that point. Oh, well.
  • I’ve been programming for 30 years and no algorithm is too difficult to implement. When I have an idea, I can implement and evaluate it. And I can do this very quickly allowing me to rapidly iterate.
  • I worked very hard at this problem. I probably would have made more money working as a consultant for the hours I spend than I had if I won the competition… But the amount I learned was well worth the investment!

Conclusion

Find a method that allows you to iterate quickly. Automate as much as possible. Precompute as much as possible. Find a method that allows you to accurately validate your performance offline, so you don’t have to waste submissions to figure out if you’re doing well.

Right now, I’m competing in a Kaggle related to the travel industry – where I’m trying to evolve a solution. I could use 100* the CPU/Disk power I currently have and still not have enough. But I’m learning every step of the way, and I’m enjoying it immensely! And currently I’m falling quickly in the ratings, but I’m hoping on this last run! Or the one after that…

LearntoCode.biz – Learn to code by solving puzzles

We’ve finally finalized our website for our game Machinist-Fabrique, and the site’s called http://learntocode.biz .
 
Machinist-Fabrique is a game where you learn to code by solving puzzles. The tools at your disposal are different code concepts. It’s fun and easy and suited for kids 10+ (and adults). We believe that learning to code is and will be an essential skill for our children and for ourselves.

Speed up MVC Visual Studio Development

Whenever I work with Visual Studio doing MVC development, I find that starting the web-site from a recompile takes far too long. So I asked my pal Tobias why it’s so fast when he’s doing it. The answer
I don’t do a full rebuild and I don’t attach the debugger
  1. Press Ctrl-F5 to start the website without the debugger
  2. Use the browser
  3. If you make changes,
    1. hit Shift-F6 to compile and update only the project of the file you’re currently in.
    2. if the solution has focus, the entire solution is rebuilt
    3. hit F5 to update the browser with the changes.
  4. If you need to debug attach the debugger by pressing Ctrl+Alt+P

Thanks, Tobias. I’m writing this blogpost so I won’t have to ask you again!

Machinist–Fabrique on IndieDB.com

My upcoming game, Machinist-Fabrique, is getting it’s own Indiedb entry – check it out at the link below (once it’s gone live, that is)

Machinist - Fabrique

Machinist

R.U.B.E physics test of Machinist level

 

I’m working on a game called Machinist and I found this awesome tool called R.U.B.E. that lets me graphically design stuff for the Box2D Physics Engine. If you wish to play with Box2D, I warmly recommend try to start with R.U.B.E.

Designing this level using rube took a few hours (I’d never used the tool before). Doing it in code would have been impossible, because those shapes are way to finicky.

Worng

You know something’s amiss when you write this in your map editor;

image

 

 

 

 

And it looks like this in your renderer;

image

 

 

 

 

 

 

That’s just plain Worng!
Follow

Get every new post delivered to your Inbox.