Matlaby Happiness

November 13, 2008

Recently, I found myself doing weird things in Matlab (factored MDPs are to blame): using cell arrays, records, and cell arrays of records containing multidimensional arrays… I was already on the verge of using objects. While technically it is possible to do all this, they open unseen depths of code ugliness. Worse than that(??), they make the code really slow: they seem non-Matlabish not only to the human eye, but also to the Matlab interpreter (more on this in a later post).

All of that factored MDP stuff cries for C++ or Java or any normal programming language I don’t speak. I could see two routes: (1) calling Matlab from C++ or C++ from Matlab. I tried both combinations once, and I will not do that ever again. (2) writing everything in C++/Java, which makes the factored MDP parts ten times easier, and everything else (visualization, debugging, statistics) ten times more difficult. So far, I’ve gone for the third route: writing Matlab code with cell arrays of records containing multidimensional arrays, and swearing a lot.

But today I discovered that I can call Java from Matlab, and they mix very smoothly. I don’t know how it will work out in practice, but seems to be quite user-friendly. Automatic type conversion, my word.

I know, Matlab has Java support for ages. Ironically, when Matlab hangs up, it dumps long Java error messages – if nothing else, that could have given me a clue. I could have looked it up in the help, in the documentation, on google. But I have not. Until now. And now I’m happy.

Bragging time

November 7, 2008

I got some really good news yesterday: Pieter Spronck presented our paper (we = Pieter, Marc Ponsen and me) on CGAMES’08. The paper is about generating diverse behaviors in role-playing games, and it received the best paper award (joint with another paper). Woohoo.

The price of dying

October 29, 2008

Arcade games are a perfect fit for reinforcement learning. First of all, the reinforcement signal could not be more direct than the score displayed in a corner of the screen. Cumulative rewards, pushed into your face: maximize it! If only it were that simple! To keep you informed, these games may display other helpful things, like the number of lives left, or the time remaining. These extra bits of information make things much more complicated. Surely, they are useful: if your agent dies less frequently, or reaches the same place in less time, then it is clearly better. What unclear is: how to use this information properly?
Maximizing score, while not losing too many lives, and being as fast as possible… It seems like a proper instance of multicriteria RL. Well, the problem is, no matter how many criteria you have, sooner or later you have to combine them into a single number (linear combination, lexicographic ordering, whatever). Give penalties for dying and wasting time. But how many points would you give up to get an extra life? How many points would you give for a few extra seconds? Well, many paths are open.

The path of purism

According to the purist view, reward is what you get from the environment. If the game does not lower your score after dying, then there should not be any negative rewards for dying. If the agent is smart enough, it will learn that dying gives a long-term, delayed penalty.
Theoretically, the approach is as sound as possible. Theoretically, almost any agent is smart enough (Q-learning with 1/n-greedy approximation? Sure, it learns the best solution. Eventually.) So… there might be a gap between theory and practice.

The path of explicitness

In some games (like this Tower Defense game), you may buy extra lives from your money/score. You don’t have to do anything to calculate the value, it is printed on the price tag. Other games (Mario jumps into my mind (note: this is a much better implementation, but lacks the timer to serve as an example)) are kind enough to give you the exact price of wasted time: you get bonus points for each remaining second.

The path of engineering

In case the game is not so helpful, you have to do it yourself: penalize deaths/time-wasting explicitly. Reward shaping works. The fine print is: what you get might not quite be what you wished for. If the penalty is too low, it has no deterring effect, if it is too high, the agent may become a whimpering coward, hiding in a corner instead of collecting points. (Another example I heard yesterday: an agent, living in a desolate gridworld with nothing more than walls and deadly pits, had to find an exit as fast as possible. The agent got just a little bit too much per-timestep penalty, so it decided that the exit is too far, it is better to commit suicide quickly). The solution? Engineering, parameter tuning, giving retrospective explanations why 1.34 was the only possible logical setting for the parameter.

The path of Arrrgh

The Unfair Platformer (try it! a really fun way to raise your blood pressure!) is something completely different.

It has no scores, no time limits, and you can die as many times as you like (but you will do so many, many more times). You have to learn a path to the end of the levels, with a minimal number of deaths during the learning process (it’s just like minimizing the sample complexity of exploration, isn’t it?) This video (of another unfair platform game) demonstrates beautifully, what “optimism in the face of uncertainty” means:

In my previous post, I tried to make an impression of myself as being “the poor overhauled guy who does not have five free minutes to tend his blog”. Which I was, of course. Nonetheless, my excuses would have been much more honest, if I wrote that one day earlier. But I couldn’t. I had to play. A role-playing game, what else. To be precise, it’s not the kind of RPG that tells heroic stories, it is a hack’n’slash RPG (ironically, these games involve about as much role-playing as a game of Tic-Tac-Toe. I think, “role-playing” must refer to the fact that at some point of such games, you’ll have to kill dragons).
Role-playing games (still speaking of the hack’n’slash genre) are highly addictive. Somehow they stimulate the part of the brain that gives intrinsic rewards, to encourage activities that give no advantage at the moment, but they feel like progress. You feel highly inspired to improve one more level, to collect the funds for a bit better armor. Or, as in my case, to hack some monsters to get the money to buy a luck-boost to increase the chance of getting rubies, that are needed to enhance your sword, so that I can hack a boss monster to get some emeralds to enhance my other sword so that I can hack that other boss monster… (discontinued out of politeness to the reader).

The game Ginormo Sword is among the most addictive RPGs I’ve seen, and it’s hard to see why. The screenshot above tells everything about the lack of graphics. Audio is on par with the graphics. There is not even a tiny bit of storyline. And it is boring. “Boring” is not expressive enough – the game opens whole new depths of boredom. Most of the gameplay consists of picking a battlefield, where you encounter several chunks of pixels (claiming to be monsters). Then you have to hit them with your sword until their green life-bar shrinks to zero. To this end, you keep clicking the mouse for several minutes, and avoid dying. When done, you get money, and you can go to a new battlefield (or the same one again, for that matter), to click away other agressive chunks of pixels. Repeat. Many times.

Can you imagine how boring it is? I think you cannot until you try it :-) And I bet you won’t be able to stop for a long time. Amazing game it is. A game made up of pure, distilled intrinsic motivation, without being obscured by graphics, story, or entertainment.
I would be overjoyed if I could formalize, what this game does to people (some poor trapped souls on this discussion board must have spent weeks with this game). If only I could inspire a reinforcement learning agent to chase progress with such perseverence! I know of a few attempts (one, two, three). We’ve recently submitted a paper about our own take on intrinsic rewards (yes, there is RPG inside). But none of these attempts seems nearly as powerful as Ginormo Sword. It is a game made up of pure intrinsic rewards, nothing else. Beware.

Signs of life

October 22, 2008

The blog is not dead – far from it! There lie lots of half-ready posts waiting to be written on my laptop! For some reason, the first half is always easier to write. I have to admit, though, that Gimme Reward has been dormant for a while. The reasons are many. Writing papers and research reports, moving from the Netherlands to Hungary, then to the US, arranging paperwork (in amazing quantities!), being frustrated about not doing anything… But crazy times seem to be over now, so stay tuned!

Sympathy for the ghosts

August 19, 2008

I’ve spent a fair bit of time playing Pac-Man, and even more by watching it. I didn’t do it for fun, but for the sake of research. Honestly. I would never play Pac-Man for having fun, as it bores me to tears. There are sooo many better games to waste my time in a funnier way. However, for reasons unclear to me, Pac-Man was extremely popular a few decades ago, and is still pretty popular (which was partly the reason for doing research in it).

It is comforting to see that many guys have wasted way more time on Pac-Man time than me. To get a taste, check the First Church of Pac-Man. Or the wikipage of Billy Mitchell, who was the first to obtain maximum score on the original Pac-Man (in 1999, 19 years after the game release!) All you have to do for this is learn some patterns. And spend some years practicing… Ms. Pac-Man is non-deterministic, but that’s no obstacle, either.

But let’s not forget about the ghosts. They have their own personalities (the very first examples of a “game AI” ever!) Blinky gets aggressive easily, Clyde likes to wander off, and Inky moves erratically. And today I found a video that made me realize the misery of their (after?)life. Watch their heart-gripping drama:

I will never play Pac-Man any more.

CiteSeer seems to be back. That might not be very fresh news, but I discovered it only now. Well, discovery might be a strong word here: Google started to put citeseer links to the first places of my article searches.

A few years ago CiteSeer was like a miracle: a site that gives you full access all the articles you’ve stumbled upon, plus the articles that cite that article, plus the bibTeX entry to save typing. A miracle indeed. Not without errors (my .bib files were messed up for years due to copy-pasting those automatically generated bibTeX entries), but it was the best available. Up to around 2004, when it seriously started to look like abandoned. The server (and its mirrors) were often unavailable, and there was a time when the only proper way to find something was to google within the citeseer pages, citeseer’s own search being completely useless. Most importantly, the article database was not updated after 2004. Google Scholar became the way to go. I don’t particularly like Scholar, but the others are much, much worse. Does anyone use Rexa, Scirus, or Libra?

But now, CiteSeer strikes back, this time with an x-ponent to make it look more dangerous (which is said to express its neXt-generationness). It is no Scholar yet, but google searches now lead you to citeseerx pages. Seems like it just entered the beta phase, so there is hope we will see improvements soon.

A little bit of pause

July 24, 2008

… I will be doing IRL exploration in Paris for two weeks. After that, I’ll be back with some tetris-related stuff (possibly with source-code), MDP rantings, and the most reinforcement learning song ever. Or maybe something completely different :-)


July 17, 2008

The Second Annual RL-Competition is over, the results can be seen here. Congratulations to all the winners! The organizers did a great job. They worked an awful lot, and they have been extremely helpful, both on the forums and through email. All in all, I think it has been a great competition.

Still, this post will be about some not-so-well-done details. I’m in an easy position with criticism, because many of the mentioned problems will be eliminated soon or are eliminated already (and because criticism is always easy, anyway. But I’m trying to be constructive, sincerely).

Read the rest of this entry »

EWRL is over now, the next part of the is late not only because of the internet access difficulties, but because of extreme exhaustion. Talks from 9am to 6pm, then spontaneous social events with various people, closed by a great banquet (held in a place that functioned as brothel, theater and porn cinema in various stages of its existence, then transformed into a luxurious restaurant). And we’ve got the coolest conference T-shirt ever.

Read the rest of this entry »