The price of dying
October 29, 2008
Arcade games are a perfect fit for reinforcement learning. First of all, the reinforcement signal could not be more direct than the score displayed in a corner of the screen. Cumulative rewards, pushed into your face: maximize it! If only it were that simple! To keep you informed, these games may display other helpful things, like the number of lives left, or the time remaining. These extra bits of information make things much more complicated. Surely, they are useful: if your agent dies less frequently, or reaches the same place in less time, then it is clearly better. What unclear is: how to use this information properly?
Maximizing score, while not losing too many lives, and being as fast as possible… It seems like a proper instance of multicriteria RL. Well, the problem is, no matter how many criteria you have, sooner or later you have to combine them into a single number (linear combination, lexicographic ordering, whatever). Give penalties for dying and wasting time. But how many points would you give up to get an extra life? How many points would you give for a few extra seconds? Well, many paths are open.
The path of purism
According to the purist view, reward is what you get from the environment. If the game does not lower your score after dying, then there should not be any negative rewards for dying. If the agent is smart enough, it will learn that dying gives a long-term, delayed penalty.
Theoretically, the approach is as sound as possible. Theoretically, almost any agent is smart enough (Q-learning with 1/n-greedy approximation? Sure, it learns the best solution. Eventually.) So… there might be a gap between theory and practice.
The path of explicitness
In some games (like this Tower Defense game), you may buy extra lives from your money/score. You don’t have to do anything to calculate the value, it is printed on the price tag. Other games (Mario jumps into my mind (note: this is a much better implementation, but lacks the timer to serve as an example)) are kind enough to give you the exact price of wasted time: you get bonus points for each remaining second.
The path of engineering
In case the game is not so helpful, you have to do it yourself: penalize deaths/time-wasting explicitly. Reward shaping works. The fine print is: what you get might not quite be what you wished for. If the penalty is too low, it has no deterring effect, if it is too high, the agent may become a whimpering coward, hiding in a corner instead of collecting points. (Another example I heard yesterday: an agent, living in a desolate gridworld with nothing more than walls and deadly pits, had to find an exit as fast as possible. The agent got just a little bit too much per-timestep penalty, so it decided that the exit is too far, it is better to commit suicide quickly). The solution? Engineering, parameter tuning, giving retrospective explanations why 1.34 was the only possible logical setting for the parameter.
The path of Arrrgh
The Unfair Platformer (try it! a really fun way to raise your blood pressure!) is something completely different.
It has no scores, no time limits, and you can die as many times as you like (but you will do so many, many more times). You have to learn a path to the end of the levels, with a minimal number of deaths during the learning process (it’s just like minimizing the sample complexity of exploration, isn’t it?) This video (of another unfair platform game) demonstrates beautifully, what “optimism in the face of uncertainty” means:
Dangerous concentrations of intrinsic rewards
October 23, 2008
In my previous post, I tried to make an impression of myself as being “the poor overhauled guy who does not have five free minutes to tend his blog”. Which I was, of course. Nonetheless, my excuses would have been much more honest, if I wrote that one day earlier. But I couldn’t. I had to play. A role-playing game, what else. To be precise, it’s not the kind of RPG that tells heroic stories, it is a hack’n’slash RPG (ironically, these games involve about as much role-playing as a game of Tic-Tac-Toe. I think, “role-playing” must refer to the fact that at some point of such games, you’ll have to kill dragons).
Role-playing games (still speaking of the hack’n’slash genre) are highly addictive. Somehow they stimulate the part of the brain that gives intrinsic rewards, to encourage activities that give no advantage at the moment, but they feel like progress. You feel highly inspired to improve one more level, to collect the funds for a bit better armor. Or, as in my case, to hack some monsters to get the money to buy a luck-boost to increase the chance of getting rubies, that are needed to enhance your sword, so that I can hack a boss monster to get some emeralds to enhance my other sword so that I can hack that other boss monster… (discontinued out of politeness to the reader).
The game Ginormo Sword is among the most addictive RPGs I’ve seen, and it’s hard to see why. The screenshot above tells everything about the lack of graphics. Audio is on par with the graphics. There is not even a tiny bit of storyline. And it is boring. “Boring” is not expressive enough – the game opens whole new depths of boredom. Most of the gameplay consists of picking a battlefield, where you encounter several chunks of pixels (claiming to be monsters). Then you have to hit them with your sword until their green life-bar shrinks to zero. To this end, you keep clicking the mouse for several minutes, and avoid dying. When done, you get money, and you can go to a new battlefield (or the same one again, for that matter), to click away other agressive chunks of pixels. Repeat. Many times.
Can you imagine how boring it is? I think you cannot until you try it :-) And I bet you won’t be able to stop for a long time. Amazing game it is. A game made up of pure, distilled intrinsic motivation, without being obscured by graphics, story, or entertainment.
I would be overjoyed if I could formalize, what this game does to people (some poor trapped souls on this discussion board must have spent weeks with this game). If only I could inspire a reinforcement learning agent to chase progress with such perseverence! I know of a few attempts (one, two, three). We’ve recently submitted a paper about our own take on intrinsic rewards (yes, there is RPG inside). But none of these attempts seems nearly as powerful as Ginormo Sword. It is a game made up of pure intrinsic rewards, nothing else. Beware.
Signs of life
October 22, 2008
The blog is not dead – far from it! There lie lots of half-ready posts waiting to be written on my laptop! For some reason, the first half is always easier to write. I have to admit, though, that Gimme Reward has been dormant for a while. The reasons are many. Writing papers and research reports, moving from the Netherlands to Hungary, then to the US, arranging paperwork (in amazing quantities!), being frustrated about not doing anything… But crazy times seem to be over now, so stay tuned!

