uates each action wp, in the primed action list A and
each sensation vector xpi in the primed sensation list
X. The evaluation integrates novelty and rewards.
n(t) =
The novelty can be measured by the agreement
between what is predicted by the robot and what the
robot actually senses. If the robot can predict well
what will happen, the novelty is low. Then we can
define novelty as the normalized distance between
the selected primed sensation xpi = (x'1,x'2...x'm) and
the actual sensation x(t + 1) in next time:
ɪ ™ (x'(t)-¾(t + ιr
m ~↑ σ2j (f)
where m is the dimension of sensory input. Each
component is divided by the expected deviation σj∙,
which is the time-discounted average of the squared
difference (ar'∙ — .r7 )2. Based on IHDR, only when
the sensory input is much different from retrieved
prototype, will a new prototype be generated.
Suppose that a robot baby is staring at a toy for
a while. Gradually, the primed sensation xp can
match the actually sensed sensation well: “I will
see that puppy sitting this way next time.” Thus
the current action, staring without changing, reduces
its value in the above expression, since n(t) drops.
Then, another action, such as turning away to look
at other parts in the scene, has a relatively higher
value. Thus, the robot baby turns his eyes away.
It is necessary to note here that the novelty mea-
sure n(t) is a low level measure. The system’s prefer-
ence to a sensory input is typically not just a simple
function of n(t). Besides novelty, human trainer and
environment can shape the robot’s behaviors through
its biased sensors. A biased sensor is one whose sig-
nal has an innate preference pattern by the robot.
For example, a biased sensor value r = 1 if the hu-
man teacher presses its “good” button and p = — 1
if the human teacher presses its “bad” button. Now,
we can integrate novelty and immediate reward so
that the robot can take both factors into account.
The combined reward is defined as a weighted sum
of physical reward and the novelty:
r(t) = ap(t) + βr(t) + (1 — a — β)n(t) (2)
where 0 < a,β < 1 is an adjustable parame-
ter indicating the relative weight between p(t), r(t)
and n(t), which specify punishment, positive re-
ward and novelty, researches in animal learning
show that different reinforcers has different effect.
Punishment typically produces a change in behav-
ior much more rapidly than other forms of rein-
forcers (Domjan, 1998). So in our experiments, a >
β > 1 — a — β.
We have, however, two major problems. Firstly,
the reward r is not always consistent. Human may
make mistakes in giving rewards, and thus, the rela-
tionship between an action and the actual reward is
not always certain. The second is the delayed reward
problem. The reward due to an action is typically-
delayed since the effect of an action is typically not
known until some time after the action is complete.
These two problems are dealt with by the following
Q-Iearning algorithm.
3.3 Q learning algorithm and Boltzmann
exploration
Q-Iearning is one of the most popular reinforcement
learning algorithm (Watkins, 1992). The basic idea
is as follows. Keep a Q value for every possible pair
of primed sensation xp and every possible action ap:
Q(xp,ap), which indicates the value of action ap at
current state s. The action with the largest value
will be selected as output and then a reward r(t +1)
will be received. The Q-Iearning updating expression
is as follows:
Q(xp(t),αp(t)) := (1 - α)Q(xp(t),αp(t))
+α(r(t + 1) +7n1axα/ Q(xp(t + l),ap(t + 1)))
(3)
where a and 7 are two positive numbers between 0
and 1. The parameter a is the updating rate. The
larger it is, the faster the Q value is updated by the
recent rewards. The parameter 7 is for discount in
time. With this algorithm, Q-values are updated ac-
cording to the immediate reward r(t + 1) and the
value of the next sensation-action pair, thus delayed
reward can be back-propagated in time during learn-
ing. Because lower animals and infants only have de-
veloped a relatively simple value system, they should
be given rewards immediately after a good behavior
whenever possible. This is a technique for successful
animal training.
Early estimated Q value should not be overtrusted,
since they are not good before other actions are tired.
We applied Boltzmann exploration to Q-Iearning al-
gorithm (Sutton and Barto, 1998). At each state
(primitive prototype) the robot has a list of action
A(S) = (αp1,αp2, ...,apn) to choose from. The prob-
ability for action a to be chosen at s is:
Q(s,a)
P(s,a) =---------q(77Γ (4)
where θ is a positive parameter called temperature.
With a high temperature, all actions in A(s) almost
have the same probability to be chosen. When A→ O.
Boltzmann exploration more likely chooses action a
that has a high Q value. With this exploration mech-
anism, actions with smaller Q value are still possible
to be chosen so that action space can be explored.
Another effect of Boltzmann exploration is to avoid
local minima, like always paying attention to certain