This week I read Analogues of mental simulation and imagination in deep learning. There so much to talk about in that paper but let's start with a core concept of RL, POMDP. POMDP is exciting to me because I believe it represents human-decision-making well enough to see how far it can take AI to meet our own human decision process. I'll keep talking about it in the next posts.
The paper's background is that we can think of two types of computational approaches to intelligence: i) Statistical pattern recognition, which focuses mostly in prediction (classification, regression, task control) and ii) model-building, which focuses in creating models of the world (to explain what we see, to imagine what could have happened that didn’t, or what could be true that isn’t, and then planning actions to make it so). The paper talks about the second approach, deep learning methods for constructing models from data and learning to use them via reinforcement learning (RL), and compares such approaches to human mental simulation.
RL is about agents taking actions in an environment to maximize a cumulative reward. The way RL represents this process is through something called the Partially-Observable Markov Decision Process (POMDP). Let me explain: RL is the closest AI has to human behavior. We are the agents in an environment from which we know some part of it (partially-observable) and we make decisions that translate into actions with the hope to gain something (money, time, love, peace of mind, etc). But for this process to be computable, RL represents it as a POMDP. Which is a mathematical framework for dealing with a decision process where the outcomes are partially random and partially in control of the decision maker. POMDP is super cool and worth understanding, because it provides a useful framing for a variety of mental simulation phenomena. It illustrates how behaviors that seem quite different on the surface share a number of computational similarities.
These are graphs from the paper. In the right graph we can see examples of how to use POMDP to represent human-mental-models. I'll explain first the graph on the left, which is a diagram of POMDP. Then I'll use it to see one example from the right graph. Let's look at the left graph. s is the environment state, a full description of the world which the agent can't directly observe, a the action taken on the state of the world made by the agent, a is the only variable that can be intervened on. o is the observation of the environment directly perceived by the agent, r is a number that represents the reward which tells the agent how well its doing in the task, and that can be also observed by the agent, and t time. The arrows represent dependencies. Basically there is a state st (s at time t) that will influence ot, rt, st+1. But at starts independent of st and then also influence rt and st+1. Then the cycle repeats.
Some examples of human model simulation (physical prediction, mental rotation, theory of the mind, human reinforcement learning) can be seeing on the right graph. Even-though they are different all of them can be represented by a POMDP. I'll explain (c) Theory of the mind. Theory of the mind is the ability to attribute mental states—beliefs, intents, desires, emotions, knowledge, etc.—to oneself, and to others, and to understand that others have beliefs, desires, intentions, and perspectives that are different from one's own. This example represents the thought process of an agent A while it tries to figure out if agent B desires to eat pizza or hamburger. Let's assume that A is behind B in a street where the pizza restaurant is before the hamburger restaurant. A can see in which restaurant B will get in. A starts by thinking that is more probable that B will want to eat pizza (maybe A likes pizza more). A can observe Blooking at both restaurants (t), then A sees B walking towards the pizza place (t+1). If B get's into the restaurant then A can decide that B desires pizza more than hamburger, but if B keeps walking and get's inside the hamburger restaurant A will decide that B desires hamburgers more.