RL-Competition

July 17, 2008

The Second Annual RL-Competition is over, the results can be seen here. Congratulations to all the winners! The organizers did a great job. They worked an awful lot, and they have been extremely helpful, both on the forums and through email. All in all, I think it has been a great competition.

Still, this post will be about some not-so-well-done details. I’m in an easy position with criticism, because many of the mentioned problems will be eliminated soon or are eliminated already (and because criticism is always easy, anyway. But I’m trying to be constructive, sincerely).

Not so easy to begin

140 groups downloaded the code (and possibly took a look at it), but only 20 of them decided to submit anything. That might be a pretty normal ratio. Still, I think that many potential contestants were stopped by the high entry barrier. For example, compiling the code was far from trivial under Windows (possibly easier under *nix and *nux, though). What is more important, we had to do a lot of boring low level stuff (decoding the state, doing some standard analysis, encoding actions) before going on to get any “real work” done. A minor point, but a related one was the lack of Matlab support. Matlab is ideal for quickly hacking up ideas, testing them, then throwing the bad ones away.

Well, it seems like there will be no such barriers next year: the Matlab interface will be out this Autumn, several of the contest program sources will be available soon (or are available already, more on this later), so they can be used as a starting point for new ideas, and a Windows guide for dummies is also coming soon.

Benchmark tasks? Which?

As I understand, a priority goal of RL-Competitions is to provide standardized benchmark tasks. But now we have more than we would possibly wish for: an infinity of benchmarks. Each task has several continuous parameters that perturb the task slightly. 10 parameter sets are drawn randomly (for some benchmarks less than 10, just to be precise), these will be the competition tasks.

But are they benchmarks? Probably not. Next year’s MDPs will be different (maybe drawn from the same probability distribution, but still different samples). This year’s controllers probably won’t compete on next year’s MDPs. It will not be easy to compare the numbers. Imagine that in 2011, say, you create the Ultimate Unbeatable Mountain Car Solver, and wish to demonstrate its unsurpassable capabilities. Which MDPs should you test on? The ones of 2008? 2009? 2010? Ten new random MDPs?

The Book of Records might solve the problem. Announced officially two days ago, the Book of Records keeps results on a single fixed parametrization of all benchmark tasks, giving an absolute comparison. While it’s a really good and badly needed initiative (and I intend to write about it in detail), I think it is only a partial solution: your mighty UUMCS still needs at least two performance measures: one for the ongoing Competition, and one for the record book. And, as Segal’s law states, “A man with a watch knows what time it is. A man with two watches is never sure.

For Tetris, the situation is a bit even worse: you need three numbers for comparison. Tetris is possibly the only RL task that did have a standard formulation (of Bertsekas and Ioffe). Back in 2006 I managed to collect cca. 10 algorithms that were tested on the exact same problem. And if you build an Ultimate Unbeatable Tetris Solver, you should really try it with the old rules (or re-run the 10 old algorithms with the new rules…)

Learning?

In principle, the competition was about reinforcement learning, that is, exploring and an unknown environment. Still, as far as I could see, learning wasn’t over-emphasized in the entries: it was done offline, before the tests, or its scope was severely limited. No wonder, each step spent by exploration would lower your final score, so a fixing mediocre general controller could be better than exploring a lot in the beginning to get a better one. In some cases, this is justifiable: when balancing a helicopter, you should really get a huge penalty for exploring the “crash-landing” state. In other cases, however, we should be more forgiving with exploration. The reason is simple: the contestants should be tempted to try some cleverly learning algorithm instead of a more-or-less fixed controllers.

I’m happy to say that this will also be resolved next year: probably there will be an “exploration phase” that does not count into the final score.

Unbalanced generalization (in Tetris)

For the Tetris benchmark, task generalization has some serious side effects (I cannot speak with confidence about the other benchmarks, but some of them may also have this problem). In Tetris, one of the random-drawn parameters is the board width (it can be anything between 6 and 12). If everything else is left unchanged, a 6-wide board has 4 times higher weight (!) than a 12-wide one: you can complete lines twice as fast, and moving the pieces to their place takes roughly half the time. This means that by the time you place a tetromino on the 12-wide board (filling 1/3 rows), you can place two of them on the 6-wide one (filling 4/3 rows). Besides board width, other varying parameters (board height, tetromino probabilities, multi-row rewards) also have the potential to distort scores, although to a lesser extent.

Trying to be constructive

Standardize parameter sets

This year, random-sampled MDPs were part of the challenge: we could not know which parts of the MDPs are to be varied (still, it was not really an exploration task for the algorithms, but rather an exploration task for the programmers). Next year, this “element of surprise” will have been gone, so I don’t see many arguments in favor of random sampling.

Fixing the set of parameters would help the benchmarkization of the tasks a lot. The organizers could fix the parameters to this year’s values. Or hand-pick 10 different sets so that every types of problems are well-represented. Or even hand-pick a single parameter set so that we get a single, definitive benchmark (which could be the same as the Record Book benchmark).

If there will be multiple parameter sets, their contributions to the score should be balanced. One way to do this (not necessarly a good one, though) is to normalize each MDP’s contribution with the score of some standard controller. This year’s winners would be obvious choices for becoming “standard controllers”.

Open source

The source of the competition software is open, and we should definitely do the same with the competition controllers (yes, I’m doing my part here).

An easy way to recruit contestants is to let them experiment with working solutions, letting them change little bits. If someone thinks “I have a better idea, let’s change this part”, then we have another contestant on the hook. The pros and cons are shown beautifully by the Matlab contests, where each entry is open-source (after a short initial period of “darkness”). On the other hand, the leading programs quickly get hundreds of clones (differing only in several bytes), which leads to “benchmark overfitting”, and what is even more important, having an original idea will not necessarily get you the first place: anyone can snatch your code, tweak a number or two, and be 0.1% better.

I think that almost all of the drawbacks can be evaded if the source codes are made public only after the end of the competition. So probably it would be a nice addition to the competition rules that competitors should also submit their source codes, which are released after the deadline. For the greater benefit of the RL community :-)

And, as a token of being constructive, I try to bring my code into uploadable form (that is, adding some comments) very soon. And until that, I feel deeply ashamed for speaking (so much) before acting.

Comments, ideas, disagreements?

3 Responses to “RL-Competition”


  1. [...] – bookmarked by 4 members originally found by rillaith on 2008-10-26 RL-Competition http://gimmereward.wordpress.com/2008/07/17/rl-competition/ – bookmarked by 4 members originally [...]

  2. sandrar Says:

    Hi! I was surfing and found your blog post… nice! I love your blog. :) Cheers! Sandra. R.


  3. Sign: asygn Hello!!! utkdd and 2287rcdupbykbs and 2343 : Thanks. We look forward to hearing from you again and for your opinions on the world of work.


Leave a Reply