Evolving a better (?) solution

This post relates to the Expedia Hotel Recommendations that I’m currently participating in on Kaggle. It’s extremely specific to this competition, if any of the techniques I use prove interesting, I’ll come up with something more general some time in the future. See discussions on this post.

If you’re looking for an executive summary of some machine learning strategies, then this post won’t punch your ticket.

I thought I’d pay back those who’ve been generous with information and scripts by outlining what I’m currently trying. It won’t probably be useful to you, but it may get some ideas going! I’m using c# and all my stuff is scratch-built and not open source. Sorry.

The goal of this competition is to predict which hotel category (out of 100) a user booked given historical data and information about the search that the user did. The current go-to solution for this competition seems to be a counting solution. You basically count the users that booked during a similar search (say destination country and same user location city). To suggest the best hotel category, you pick the most common hotel category for a given combination of features.

So: you combine a number of features (out of more than 22) and pick whichever hotel category is most common for that combination of features.

But this doesn’t give you all the hotel clusters you need. One that does really well may only give you a small percentage of the votes you need. So add the hotel clusters from another combination of features until you’ve a hotel cluster for each user.

Time is precious and it’s slipping away

There are several difficulties with this,

  • Loading the data to iterate takes a long time
  • It takes quite some time to evaluate a single given combination of features
  • Combining several combinations of features takes even longer, and ut uses massive amounts of RAM. Some of my combinations combined 20+ combinations of several features.
  • Figuring out which features to combine and keeping track of how well they did is really painful.

Loading the data

I don’t know about you, but the first thing I do in a Kaggle is to try to  make iterations quicker. And for this Kaggle, I converted the csv files to binary files. This made a world of difference. I have one file with my training data, one with my validation data and one with my testing data. Loading these files takes less than one minute, which cuts down on iteration time immensely.

Computing the hotel clusters of one feature combination

I code in c# and the Dictionary implementation is plenty fast. There isn’t much I can do to improve that. And the RAM requirements while building the dictionary; there isn’t much I can do about that. But there’s a trade off between storage and CPU, and that’s pre-computing! So what I do is I pre-compute the hotel-clusters of any given combination of features and simply store it to disk. If I ever change which hotel-clusters a given combination would generate, then I have to delete my files, but that’s rare. Pre-computing about 300 combinations of features take a couple of hours with my C# code. But once they’re there, they’re quite fast to load from disk!

Here’s a picture of a files of hotel clusters;

Pre-generated

As you can see, they’re quite compact, the largest ones (9M) are *tiny* compared to the CSV submissions you send to Kaggle (50M).

And here’s a snippet from the program that generates the pre-computed files;

orig_destination_distance, user_location_city    0.0558    0.8704    32139    0.1032    25.3    0.8704
srch_destination_id, user_id, user_location_city    0.0008    0.3629    1126    0.0509    11.9    0.3629

Now, if it seems like there would be a huge number of feature combinations, you’re right. It’s astronomical. So you have to limit the maximum number of features you’re going to combine (in my case I’m working with 4 at the moment) and which features to include (I’m working with 10 at the moment). This leaves me with 385 unique combinations. That’s doable. Add one more feature? Well, that number grows larger, we’re talking about factorials here.

Combining several combinations of features

To combine several features you traditionally would keep the dictionaries/counts in RAM, which quickly becomes prohibitive. But there’s no reason to do that, once you’ve determined which hotel clusters a particular combination of features would suggest given it’s counts – the counts are no longer useful and can be discarded. So to combine two feature combinations (say [hotel_country] and [srch_destination_id, user_id, user_location_city]) you simply concatenate add the hotel clusters together until you have one per user. This becomes very fast. Which is required for the next step…

Evolving a list of feature combinations

How do we figure out which feature combinations to use? And in which order? And each feature combination can return a total of 5 hotel cluster suggestions, should we use all?

Did it seem like there would be a lot of feature combinations to manage (turned out to be 385 when limits were added)? Well, that holds doubly^10 true here. If you allow 15 different feature combinations, picked from 385 different variants, that number comes out to gazillions. Puts the fac in factorial. Exchaustive searching goes right out the window. Anyone has the right to their own opinion, but no their own fact(orials)s.

Clever clever

A clever few of you are thinking “yeah, but once you’ve used a combination, you wouldn’t use it again”. Well wrong, soo wrong. In an effort to give me even more options to work with, I’m only adding the *first* hotel cluster from any given feature combination. Using the same hotel combination once more will add the second and so on. Simply put; I add the first suggested hotel cluster that doesn’t already exist for that user.

He used to do surgery for girls in the eighties, but gravity always wins out

So how to figure which frigging combinations to use? It’s hardly be possible to even enumerate the possible ones! Well, evolution to the rescue! What I did was create a list evolver that uses trivial Genetic Algorithm techniques. It literally took me less than two hours to implement. You’ll find the actual code below.

It first creates 200 randomly generated cluster combination “lists”. The best ones are propagated into the next generation, where they get a chance to create some mutated versions of themselves. This goes one for quite a few generations – hopefully while better and better solutions come up. My best solution, at the moment, is at 0.50561, and that was created using the techniques described.

Combining techniques

My list evolver from the previous paragraph uses whatever solution gives the best results for the validation period. This means that other solutions can be worked into the process, even if they were generated by some other means than by the counting technique. Just as long as they generate appropriate results for the validation period. So in the end I need results for the validation period and the testing period. But that’s easy enough to generate. At that point, I can add XGBoost results (terrible!) with Dictionary results (pretty good!) and clustering techniques (remains to be seen!).

I don’t get it, how does evolution chose?

I’ve been using evolution (specifically Genetic Algorithms (GA), Genetic Programming (GP) and evolved neural networks)  since about 1996 for different projects. GA isn’t an optimal solution to most (if any) problems. One problem is that it isn’t greedy which means you can’t pick a nice trajectory and follow it. It’s inherently stochastic, which makes it slower than a greedy or numerical solutions would be. But you can trivially use it anywhere even though you don’t know of any greedy or numerical solution.

If you can measure which solution is better out of two different solutions (let them duel it out!), or how good a solution is in relation to all possible solutions (a score), you can make GA work. You just make more mutated copies of the currently best solution and keeping the best ones as you go along. In thgis specific case, measure how well the solution would do in the competition. Keep the best ones, discard the worse ones.

A trivial example

Here’s a trivial example demonstrating how my list evolver work. A list of no more than 6 item must be filled by the number -1 to 10. Below you see the lists it comes up with during a couple of generations.

1: score=35/70*: 10, 9, 7, 1, 8
2: score=45/90*: 10, 10, 9, 7, 1, 8
3: score=51/102*: 10, 10, 9, 7, 7, 8
4: score=53/106*: 10, 10, 10, 9, 7, 7
5: score=56/112*: 10, 10, 10, 10, 9, 7
6: score=59/118*: 10, 10, 10, 10, 10, 9
7: score=60/120*: 10, 10, 10, 10, 10, 10

The solution is so obvious given the premises and it takes milliseconds to find. But finding a good combination of feature combinations to use in this kaggle competition? Takes about a day. And there’s no guarantee that it’s an optimal solution.

Feel free to ask questions, I’ll do my best to answer. During the competition questions are best addressed to where I announce this post in the forums.

Code for to the list evolver class

using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;

namespace Boltzman.Core.Utilities
{
public class ListEvolver<TListItem>
{
public ListEvolver(
Func<Genotype, float> getFastScore,
Func<Genotype, float> getFullScore,
params TListItem[] possibleItems)
{
GetFastScore = getFastScore;
GetFullScore = getFullScore;
PossibleItems = possibleItems.ToList();
PopulationSize = 100;
BestGenotypeFastScoreCutoff = 0.8f;
ParsimonyCost = 0;
InitialSize = 5;
TriedItems = new HashSet<OrderHashedList<TListItem>>();
}

public float BestGenotypeFastScoreCutoff { get; set; }
public int GenesInFullMutate { get; set; }
public int PopulationSize { get; set; }
public int InitialSize { get; set; }
public List<Genotype> Population { get; set; }
public float ParsimonyCost { get; set; }
public Func<Genotype, float> GetFastScore { get; set; }
public Func<Genotype, float> GetFullScore { get; set; }
public List<TListItem> PossibleItems { get; set; }
public HashSet<OrderHashedList<TListItem>> TriedItems { get; set; }
public Genotype BestGenotype { get; set; }
public Action<bool> AfterEpochAction { get; set; }
public int Epoch { get; private set; }
public bool AllowMultipleEntriesOfSameItem { get; set; }
public int MaxLength { get; set; }

public Genotype Evaluate(List<TListItem> items)
{
if (MaxLength > 0 && items.Count() > MaxLength)
{
items = items.Take(MaxLength).ToList();
}
OrderHashedList<TListItem> orderHashedList = new OrderHashedList<TListItem>(items);

lock (TriedItems)
{
if (TriedItems.Contains(orderHashedList))
{
return null;
}
}

Genotype genotype = new Genotype(items);
genotype.FastScore = GetFastScore(genotype);
if (BestGenotype == null || genotype.FastScore > BestGenotype.FastScore * BestGenotypeFastScoreCutoff)
{
if (GetFullScore == null)
{
genotype.FullScore = genotype.FastScore;
}
else
{
genotype.FullScore = GetFullScore(genotype);
}
}

lock (TriedItems)
{
if (TriedItems.Contains(orderHashedList))
{
return null;
}
TriedItems.Add(orderHashedList);

if (BestGenotype == null)
{
BestGenotype = genotype;
}
else if (genotype.FullScore.HasValue)
{
float actualScore = ComputeScore(genotype);
float bestActualScore = ComputeScore(BestGenotype);

if (actualScore > bestActualScore)
{
BestGenotype = genotype;
}
return genotype;
}
}
return null;
}

public void RunEpoch()
{
Epoch++;

if (Population == null)
{
BuildInitialPopulation();
AfterEpochAction(true);
return;
}

List<Genotype> oldPopulation = Population.ToList();
Population.Clear();
Genotype previousBestGenotype = BestGenotype;

List<Func<List<TListItem>, List<TListItem>>> actions =
new List<Func<List<TListItem>, List<TListItem>>>
{
x=>MutateFull(InitialSize),
MutateDelete,
MutateReplace,
MutateRepositionSmall,
MutateRepositionLarge,
MutateInsert
};

if (AllowMultipleEntriesOfSameItem)
{
actions.Add(MutateSplit);
}

if (BestGenotype != null)
{
Population.Add(BestGenotype);
}

while (Population.Count < oldPopulation.Count * 0.1f)
{
Population.Add(oldPopulation.RandomItems(10).Top(x => ComputeScore(x)));
}

Parallel.For(Population.Count – 1, PopulationSize, itemId =>
{
Genotype parent = oldPopulation.RandomItems(10).Top(x => ComputeScore(x));
Genotype genotype = Evaluate(actions.RandomItem()(parent.List));
if (genotype != null)
{
Population.LockedAdd(genotype);
}
});

AfterEpochAction?.Invoke(previousBestGenotype != BestGenotype);
}

private void BuildInitialPopulation()
{
Population = new List<Genotype>(PopulationSize);
Parallel.For(
0,
PopulationSize, i =>
{
lock (Population)
{
if (Population.Count == PopulationSize)
{
return;
}
}

List<TListItem> list = MutateFull(InitialSize);
Genotype genotype = Evaluate(list);
if (genotype != null)
{
Population.LockedAdd(genotype);
}
});
}

private List<TListItem> MutateReplace(List<TListItem> list)
{
list = list.ToList();
for (int i = 0; i < 5; i++)
{
TListItem item = PossibleItems.RandomItem();
if (!list.Contains(item) || AllowMultipleEntriesOfSameItem)
{
list[ThreadSafeRandom.GetNextInt(list.Count)] = item;
return list;
}
}

return list;
}

private List<TListItem> MutateSplit(List<TListItem> list)
{
list = list.ToList();
int pos = ThreadSafeRandom.GetNextInt(list.Count);
list.Insert(pos, list[pos]);
return list;
}

private List<TListItem> MutateDelete(List<TListItem> list)
{
if (list.Count > 0)
{
list = list.ToList();
list.RemoveAt(ThreadSafeRandom.GetNextInt(list.Count));
}
return list;
}

private List<TListItem> MutateRepositionLarge(List<TListItem> list)
{
list = list.ToList();
int pos = ThreadSafeRandom.GetNextInt(list.Count);
TListItem moved = list[pos];
list.RemoveAt(pos);
list.Insert(ThreadSafeRandom.GetNextInt(list.Count), moved);
return list;
}

private List<TListItem> MutateRepositionSmall(List<TListItem> list)
{
list = list.ToList();
if (list.Count < 2)
{
return list;
}
int pos = ThreadSafeRandom.GetNextInt(list.Count);
TListItem item = list[pos];
int otherPos = pos + (ThreadSafeRandom.GetNextBool() ? -1 : 1);
if (otherPos <= 0)
{
otherPos = pos + 1;
}
if (otherPos >= list.Count)
{
otherPos = pos – 1;
}
list[pos] = list[otherPos];
list[otherPos] = item;
return list;
}

private List<TListItem> MutateInsert(List<TListItem> list)
{
list = list.ToList();
for (int i = 0; i < 5; i++)
{
TListItem item = PossibleItems.RandomItem();
if (!list.Contains(item) || AllowMultipleEntriesOfSameItem)
{
list.Insert(ThreadSafeRandom.GetNextInt(list.Count + 1), item);
return list;
}
}

return list;
}

public List<TListItem> MutateFull(int count)
{
List<TListItem> items = new List<TListItem>();
for (int i = 0; i < count; i++)
{
TListItem item = PossibleItems.RandomItem();
if (!items.Contains(item))
{
items.Add(item);
}
}
return items;
}

private float ComputeScore(Genotype genotype)
{
return genotype.FullScore.Value – genotype.List.Count * ParsimonyCost;
}

public class Genotype
{
public Genotype(List<TListItem> list)
{
List = list;
}

public List<TListItem> List { get; set; }
public float FastScore { get; set; }
public float? FullScore { get; set; }
}
}
}

About mfagerlund
Writes code in my sleep - and sometimes it even compiles!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: