Novelty and Reinforcement Learning in the Value System of Developmental Robots



uates each action wp, in the primed action list A and
each sensation vector
xpi in the primed sensation list
X. The evaluation integrates novelty and rewards.

n(t) =


The novelty can be measured by the agreement
between what is predicted by the robot and what the
robot actually senses. If the robot can predict well
what will happen, the novelty is low. Then we can
define novelty as the normalized distance between
the selected primed sensation
xpi = (x'1,x'2...x'm) and
the actual sensation
x(t + 1) in next time:
ɪ ™
(x'(t)-¾(t + ιr

m ~↑ σ2j (f)
where
m is the dimension of sensory input. Each
component is divided by the expected deviation σj∙,
which is the time-discounted average of the squared
difference (ar'∙ — .r7 )2. Based on IHDR, only when
the sensory input is much different from retrieved
prototype, will a new prototype be generated.

Suppose that a robot baby is staring at a toy for
a while. Gradually, the primed sensation
xp can
match the actually sensed sensation well: “I will
see that puppy sitting this way next time.” Thus
the current action, staring without changing, reduces
its value in the above expression, since n(t) drops.
Then, another action, such as turning away to look
at other parts in the scene, has a relatively higher
value. Thus, the robot baby turns his eyes away.

It is necessary to note here that the novelty mea-
sure n(t) is a low level measure. The system’s prefer-
ence to a sensory input is typically not just a simple
function of
n(t). Besides novelty, human trainer and
environment can shape the robot’s behaviors through
its biased sensors. A biased sensor is one whose sig-
nal has an innate preference pattern by the robot.
For example, a biased sensor value
r = 1 if the hu-
man teacher presses its “good” button and
p = — 1
if the human teacher presses its “bad” button. Now,
we can integrate novelty and immediate reward so
that the robot can take both factors into account.
The combined reward is defined as a weighted sum
of physical reward and the novelty:

r(t) = ap(t) + βr(t) + (1 — a — β)n(t) (2)

where 0 < a,β < 1 is an adjustable parame-
ter indicating the relative weight between p(t), r(t)
and n(t), which specify punishment, positive re-
ward and novelty, researches in animal learning
show that different reinforcers has different effect.
Punishment typically produces a change in behav-
ior much more rapidly than other forms of rein-
forcers (Domjan, 1998). So in our experiments,
a >
β
> 1 — a — β.

We have, however, two major problems. Firstly,
the reward
r is not always consistent. Human may
make mistakes in giving rewards, and thus, the rela-
tionship between an action and the actual reward is
not always certain. The second is the delayed reward
problem. The reward due to an action is typically-
delayed since the effect of an action is typically not
known until some time after the action is complete.
These two problems are dealt with by the following
Q-Iearning algorithm.

3.3 Q learning algorithm and Boltzmann
exploration

Q-Iearning is one of the most popular reinforcement
learning algorithm (Watkins, 1992). The basic idea
is as follows. Keep a Q value for every possible pair
of primed sensation
xp and every possible action ap:
Q(xp,ap),
which indicates the value of action ap at
current state
s. The action with the largest value
will be selected as output and then a reward
r(t +1)
will be received. The Q-Iearning updating expression
is as follows:

Q(xp(t),αp(t)) := (1 - α)Q(xp(t),αp(t))

+α(r(t + 1) +7n1axα/ Q(xp(t + l),ap(t + 1)))

(3)
where
a and 7 are two positive numbers between 0
and 1. The parameter
a is the updating rate. The
larger it is, the faster the
Q value is updated by the
recent rewards. The parameter 7 is for discount in
time. With this algorithm, Q-values are updated ac-
cording to the immediate reward
r(t + 1) and the
value of the next sensation-action pair, thus delayed
reward can be back-propagated in time during learn-
ing. Because lower animals and infants only have de-
veloped a relatively simple value system, they should
be given rewards immediately after a good behavior
whenever possible. This is a technique for successful
animal training.

Early estimated Q value should not be overtrusted,
since they are not good before other actions are tired.
We applied Boltzmann exploration to Q-Iearning al-
gorithm (Sutton and Barto, 1998). At each state
(primitive prototype) the robot has a list of action
A(S) =p1,αp2, ...,apn) to choose from. The prob-
ability for action
a to be chosen at s is:

Q(s,a)

P(s,a) =---------q(77Γ           (4)

where θ is a positive parameter called temperature.
With a high temperature, all actions in A(s) almost
have the same probability to be chosen. When A→ O.
Boltzmann exploration more likely chooses action
a
that has a high Q value. With this exploration mech-
anism, actions with smaller Q value are still possible
to be chosen so that action space can be explored.
Another effect of Boltzmann exploration is to avoid
local minima, like always paying attention to certain



More intriguing information

1. Cultural Neuroeconomics of Intertemporal Choice
2. On the job rotation problem
3. Dual Track Reforms: With and Without Losers
4. The name is absent
5. Towards a framework for critical citizenship education
6. The name is absent
7. Der Einfluß der Direktdemokratie auf die Sozialpolitik
8. Who runs the IFIs?
9. DISCRIMINATORY APPROACH TO AUDITORY STIMULI IN GUINEA FOWL (NUMIDA MELEAGRIS) AFTER HYPERSTRIATAL∕HIPPOCAMP- AL BRAIN DAMAGE
10. The name is absent
11. Can we design a market for competitive health insurance? CHERE Discussion Paper No 53
12. FASTER TRAINING IN NONLINEAR ICA USING MISEP
13. Regionale Wachstumseffekte der GRW-Förderung? Eine räumlich-ökonometrische Analyse auf Basis deutscher Arbeitsmarktregionen
14. BARRIERS TO EFFICIENCY AND THE PRIVATIZATION OF TOWNSHIP-VILLAGE ENTERPRISES
15. The name is absent
16. INTERACTION EFFECTS OF PROMOTION, RESEARCH, AND PRICE SUPPORT PROGRAMS FOR U.S. COTTON
17. Pricing American-style Derivatives under the Heston Model Dynamics: A Fast Fourier Transformation in the Geske–Johnson Scheme
18. CROSS-COMMODITY PERSPECTIVE ON CONTRACTING: EVIDENCE FROM MISSISSIPPI
19. The name is absent
20. The Response of Ethiopian Grain Markets to Liberalization