The price of dying

October 29, 2008

Arcade games are a perfect fit for reinforcement learning. First of all, the reinforcement signal could not be more direct than the score displayed in a corner of the screen. Cumulative rewards, pushed into your face: maximize it! If only it were that simple! To keep you informed, these games may display other helpful things, like the number of lives left, or the time remaining. These extra bits of information make things much more complicated. Surely, they are useful: if your agent dies less frequently, or reaches the same place in less time, then it is clearly better. What unclear is: how to use this information properly?
Maximizing score, while not losing too many lives, and being as fast as possible… It seems like a proper instance of multicriteria RL. Well, the problem is, no matter how many criteria you have, sooner or later you have to combine them into a single number (linear combination, lexicographic ordering, whatever). Give penalties for dying and wasting time. But how many points would you give up to get an extra life? How many points would you give for a few extra seconds? Well, many paths are open.

The path of purism

According to the purist view, reward is what you get from the environment. If the game does not lower your score after dying, then there should not be any negative rewards for dying. If the agent is smart enough, it will learn that dying gives a long-term, delayed penalty.
Theoretically, the approach is as sound as possible. Theoretically, almost any agent is smart enough (Q-learning with 1/n-greedy approximation? Sure, it learns the best solution. Eventually.) So… there might be a gap between theory and practice.

The path of explicitness

In some games (like this Tower Defense game), you may buy extra lives from your money/score. You don’t have to do anything to calculate the value, it is printed on the price tag. Other games (Mario jumps into my mind (note: this is a much better implementation, but lacks the timer to serve as an example)) are kind enough to give you the exact price of wasted time: you get bonus points for each remaining second.

The path of engineering

In case the game is not so helpful, you have to do it yourself: penalize deaths/time-wasting explicitly. Reward shaping works. The fine print is: what you get might not quite be what you wished for. If the penalty is too low, it has no deterring effect, if it is too high, the agent may become a whimpering coward, hiding in a corner instead of collecting points. (Another example I heard yesterday: an agent, living in a desolate gridworld with nothing more than walls and deadly pits, had to find an exit as fast as possible. The agent got just a little bit too much per-timestep penalty, so it decided that the exit is too far, it is better to commit suicide quickly). The solution? Engineering, parameter tuning, giving retrospective explanations why 1.34 was the only possible logical setting for the parameter.

The path of Arrrgh

The Unfair Platformer (try it! a really fun way to raise your blood pressure!) is something completely different.

It has no scores, no time limits, and you can die as many times as you like (but you will do so many, many more times). You have to learn a path to the end of the levels, with a minimal number of deaths during the learning process (it’s just like minimizing the sample complexity of exploration, isn’t it?) This video (of another unfair platform game) demonstrates beautifully, what “optimism in the face of uncertainty” means: