Novelty and Reinforcement Learning in the Value System of Developmental Robots



Combined RewardNovelty    Reward Q-Value

Illl

->. O -ʌ M θɔ —4- O -ʌ M θɔ —4- O -ʌ M OO     O -ʌ M

Z ⅛ ____________________ Z ⅛ ____________________ Z ⅛ ____________________ Z ⅛ ____________________

^ ______

--- Stay

-- - Left

Right

)_______100     200     300     400

500     600

_ . . . . .. . -------- _ _

Stay

Left ■
Right

)       100     200     300     400

500     600

-   ∙∙ ∙       ∙    ∙       ⅜ ∙     ∙ ζ             .* , .∙ ∙            V Д

Stay

Left ■
Right

)_______100     200     300     400

500     600

. ∙      : .       .: ∙.∙λ .....∙.-                     t∙

Stay

Left ■
Right

1       100     200     300     400

Time

500     600


Figure 12: Tree Structure. Each block indicates a tree
node. The first row of each node shows the x-cluster
centers presented as images. The first image of the sec-
ond row is the grand mean of all the х-clusters. The
remaining images of the second row are the discriminat-
ing features represented as images. Here a Gaussian filter
is used to alleviate noise.


Figure 9: Tlie Q-value, reward, novelty and integrated
reward of each action at position -2.


Figure 10: Preference to certain visual stimuli.


5.2 Multiple rewards for different actions

In this experiment, we gave different rewards to each
action at position 2. In the beginning (first 200
steps), we kept moving a toy, so the Q-valne of ac-
tion 0 (stay) is the highest one (the first plot in Fig.
13). The value of novelty is shown in the third plot.
Then punishment was issued to action 0 at step 205.
Its Q-value became negative. Positive rewards were
issued to action 1 and 2 (the second plot). Action 1
got more positive rewards, finally its Q-value became
the largest. The fifth plot shows the changes of learn-
ing rate. The initial learning rate is 0.9. If rewards
are issued, the learning rate decreases (around 0.3),
which means that the robot would remember rewards
much longer than novelty.

No. of prototypes in each level Retrieving time

Figure 11: Real time testing information.


0

100     200     300     400     500     600

Time

Figure 13: The Q-value, reward, novelty, integrated re-
ward and learning rate of each action at position 2 when
multiple rewards are issued.




More intriguing information

1. The name is absent
2. Industrial Employment Growth in Spanish Regions - the Role Played by Size, Innovation, and Spatial Aspects
3. THE ECONOMICS OF COMPETITION IN HEALTH INSURANCE- THE IRISH CASE STUDY.
4. DIVERSITY OF RURAL PLACES - TEXAS
5. Imitation in location choice
6. The name is absent
7. The name is absent
8. Motivations, Values and Emotions: Three Sides of the same Coin
9. Ruptures in the probability scale. Calculation of ruptures’ values
10. The name is absent