Novelty and Reinforcement Learning in the Value System of Developmental Robots



O decreases. For action 1 and action 2, at the be-
ginning, the primed sensation is set as a long vector
in which every element- is zero. After taking an ac-
tion, the current sensation is very different from the
initial sensation. That is, the novelty value is high.
We can see from Fig. 3 that at first- several steps,
the Q-Values of these two actions increase. However,
after we update the primed sensation, the primed
sensation will be the same as the actual sensation
if the robot- takes the action again. Then the nov-
elty becomes zero and Q-valuc decreases. After a
long time training (300 steps), the robot- can predict
the actual sensation of next- step whatever action it
takes. So the Q-valuc of each action converges to
the same value (0). This means each action has the
same probability to be chosen. The right- part- of
Fig. 3 shows the number of each action in different-
time frames (60 steps in each time frame). At- the
beginning, action 0 has a larger Q-valuc, according
to Boltzmann exploration, it- has more chance to be
chosen. The probability of action 1 and action 2 is
almost- the same. After 300 steps, the Q-valuc of each
action is nearly equal, so the numbers of each action
are close. The experiment- shows that because of ha-
bituation effect-, the robot- loses the interest- of any
action after exploration and just- chooses an action
randomly.

1

0

-1

1

0

-1

1

0


-1-----------------------------------------------

1     5     10    15    20    25

Action 0 (Stare)

5    10    15    20    25

I---- Action 1 (Turn Left) I

5    10    15    20    25

Action 2 (Turn Right)

Action number


Time


Q-value vs time


Figure 3: Habituation effect. In the left- part: the 1st,
2nd and 3rd plots correspond to the Q-value of action 0,
action 1, and action 2 respectively. In the right- part, the
frequency of actions in different- time frames.

4-2 Integration of novelty and immediate
reward

After the above experiment-, we began to issue re-
wards. For example, when the robot- turns left-, hu-
man teachers give it- a positive reward (1). For other
two actions, negative rewards (-1) will be issued.
Then the actual reward the robot- receives is an inte-
gration of novelty and immediate reward. For action
0 and action 2, the Q-values change in the same way
as in experiment- 1, converging to 0. The Q-valuc of
action 1 is always positive because we keep issuing
positive rewards. As we can see in the left- part- of
Fig. 4, at the beginning, the Q-valuc of action 0 is
the largest- and the robot- takes the action with a high
probability. After training, the Q-valuc of action 1 is
much larger than that of other actions. As shown in
Fig. 4, gradually, action 1 is chosen the most- often.

Q-value vs time

Action number

Figure 4: Integration of novelty and immediate reward.

4-3 Increase novelty with a moving object

Figure 5: Simulation of a moving object.


In order to show novelty preference, a moving toy
is added to the simulation environment- after exper-
iment- in Fig. 3. The testing image is shown in Fig.
5. Every time when the robot- is in the state with
the absolute viewing angle 0, one of these images is
generated randomly. Thus, the primed sensation of
action 0 is always different- from the actual sensation.
As shown in left- part- of Fig. 6, the Q-valuc of action
0 is positive because of high novelty. In contrast-, the
Q-Values of action 1 and action 2 are more near to
zero. After training, the robot- found that staying
with viewing angel of east- is the most- interesting.
So the action 0 is chosen the most- often.

4-4 Suppress novelty with immediate re-
wards

After the third experiment- , we issued positive re-
wards to action 2 (turn right), and negative rewards
to action 0. Thus, even though the novelty is high
when the robot- stares at a moving object-, the imme-
diate rewards suppress the novelty. Gradually, the
Q value of action 2 increases. As shown in Fig. 7,
after training, the robot- almost- chooses only action
2.



More intriguing information

1. The name is absent
2. The economic value of food labels: A lab experiment on safer infant milk formula
3. Wirkt eine Preisregulierung nur auf den Preis?: Anmerkungen zu den Wirkungen einer Preisregulierung auf das Werbevolumen
4. The Employment Impact of Differences in Dmand and Production
5. The name is absent
6. A Multimodal Framework for Computer Mediated Learning: The Reshaping of Curriculum Knowledge and Learning
7. What should educational research do, and how should it do it? A response to “Will a clinical approach make educational research more relevant to practice” by Jacquelien Bulterman-Bos
8. The name is absent
9. The name is absent
10. The name is absent
11. The name is absent
12. Foreword: Special Issue on Invasive Species
13. Has Competition in the Japanese Banking Sector Improved?
14. The name is absent
15. Transgression et Contestation Dans Ie conte diderotien. Pierre Hartmann Strasbourg
16. The InnoRegio-program: a new way to promote regional innovation networks - empirical results of the complementary research -
17. Quelles politiques de développement durable au Mali et à Madagascar ?
18. FOREIGN AGRICULTURAL SERVICE PROGRAMS AND FOREIGN RELATIONS
19. THE ECONOMICS OF COMPETITION IN HEALTH INSURANCE- THE IRISH CASE STUDY.
20. Searching Threshold Inflation for India