The price of dying
October 29, 2008
Arcade games are a perfect fit for reinforcement learning. First of all, the reinforcement signal could not be more direct than the score displayed in a corner of the screen. Cumulative rewards, pushed into your face: maximize it! If only it were that simple! To keep you informed, these games may display other helpful things, like the number of lives left, or the time remaining. These extra bits of information make things much more complicated. Surely, they are useful: if your agent dies less frequently, or reaches the same place in less time, then it is clearly better. What unclear is: how to use this information properly?
Maximizing score, while not losing too many lives, and being as fast as possible… It seems like a proper instance of multicriteria RL. Well, the problem is, no matter how many criteria you have, sooner or later you have to combine them into a single number (linear combination, lexicographic ordering, whatever). Give penalties for dying and wasting time. But how many points would you give up to get an extra life? How many points would you give for a few extra seconds? Well, many paths are open.
The path of purism
According to the purist view, reward is what you get from the environment. If the game does not lower your score after dying, then there should not be any negative rewards for dying. If the agent is smart enough, it will learn that dying gives a long-term, delayed penalty.
Theoretically, the approach is as sound as possible. Theoretically, almost any agent is smart enough (Q-learning with 1/n-greedy approximation? Sure, it learns the best solution. Eventually.) So… there might be a gap between theory and practice.
The path of explicitness
In some games (like this Tower Defense game), you may buy extra lives from your money/score. You don’t have to do anything to calculate the value, it is printed on the price tag. Other games (Mario jumps into my mind (note: this is a much better implementation, but lacks the timer to serve as an example)) are kind enough to give you the exact price of wasted time: you get bonus points for each remaining second.
The path of engineering
In case the game is not so helpful, you have to do it yourself: penalize deaths/time-wasting explicitly. Reward shaping works. The fine print is: what you get might not quite be what you wished for. If the penalty is too low, it has no deterring effect, if it is too high, the agent may become a whimpering coward, hiding in a corner instead of collecting points. (Another example I heard yesterday: an agent, living in a desolate gridworld with nothing more than walls and deadly pits, had to find an exit as fast as possible. The agent got just a little bit too much per-timestep penalty, so it decided that the exit is too far, it is better to commit suicide quickly). The solution? Engineering, parameter tuning, giving retrospective explanations why 1.34 was the only possible logical setting for the parameter.
The path of Arrrgh
The Unfair Platformer (try it! a really fun way to raise your blood pressure!) is something completely different.
It has no scores, no time limits, and you can die as many times as you like (but you will do so many, many more times). You have to learn a path to the end of the levels, with a minimal number of deaths during the learning process (it’s just like minimizing the sample complexity of exploration, isn’t it?) This video (of another unfair platform game) demonstrates beautifully, what “optimism in the face of uncertainty” means:
Rewards: keep out of reach of paradoxes
June 19, 2008
The reward hypothesis states
“That all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward).”
For RL researchers, this is really good news: it means that whenever they come up with a reward-maximizing algorithm, it can attack (in principle) any goal-oriented task. Although a bit vague, one can heartily agree with the hypothesis. One must be careful, though, to keep it out of reach of paradoxes.
Although in its original form Newcomb’s paradox involves omniscient deities, philosophers arguing over free will and other dubious figures, it seems like it can be liberated of these, and leave something that is still paradoxish, but also RL-ish – a dangerous mix.