Novelty and Reinforcement Learning in the Value System of Developmental Robots



Combined RewardNovelty    Reward Q-Value

Illl

->. O -ʌ M θɔ —4- O -ʌ M θɔ —4- O -ʌ M OO     O -ʌ M

Z ⅛ ____________________ Z ⅛ ____________________ Z ⅛ ____________________ Z ⅛ ____________________

^ ______

--- Stay

-- - Left

Right

)_______100     200     300     400

500     600

_ . . . . .. . -------- _ _

Stay

Left ■
Right

)       100     200     300     400

500     600

-   ∙∙ ∙       ∙    ∙       ⅜ ∙     ∙ ζ             .* , .∙ ∙            V Д

Stay

Left ■
Right

)_______100     200     300     400

500     600

. ∙      : .       .: ∙.∙λ .....∙.-                     t∙

Stay

Left ■
Right

1       100     200     300     400

Time

500     600


Figure 12: Tree Structure. Each block indicates a tree
node. The first row of each node shows the x-cluster
centers presented as images. The first image of the sec-
ond row is the grand mean of all the х-clusters. The
remaining images of the second row are the discriminat-
ing features represented as images. Here a Gaussian filter
is used to alleviate noise.


Figure 9: Tlie Q-value, reward, novelty and integrated
reward of each action at position -2.


Figure 10: Preference to certain visual stimuli.


5.2 Multiple rewards for different actions

In this experiment, we gave different rewards to each
action at position 2. In the beginning (first 200
steps), we kept moving a toy, so the Q-valne of ac-
tion 0 (stay) is the highest one (the first plot in Fig.
13). The value of novelty is shown in the third plot.
Then punishment was issued to action 0 at step 205.
Its Q-value became negative. Positive rewards were
issued to action 1 and 2 (the second plot). Action 1
got more positive rewards, finally its Q-value became
the largest. The fifth plot shows the changes of learn-
ing rate. The initial learning rate is 0.9. If rewards
are issued, the learning rate decreases (around 0.3),
which means that the robot would remember rewards
much longer than novelty.

No. of prototypes in each level Retrieving time

Figure 11: Real time testing information.


0

100     200     300     400     500     600

Time

Figure 13: The Q-value, reward, novelty, integrated re-
ward and learning rate of each action at position 2 when
multiple rewards are issued.




More intriguing information

1. Foreign direct investment in the Indian telecommunications sector
2. Should Local Public Employment Services be Merged with the Local Social Benefit Administrations?
3. TRADE NEGOTIATIONS AND THE FUTURE OF AMERICAN AGRICULTURE
4. The name is absent
5. Spatial Aggregation and Weather Risk Management
6. The name is absent
7. Chebyshev polynomial approximation to approximate partial differential equations
8. Growth and Technological Leadership in US Industries: A Spatial Econometric Analysis at the State Level, 1963-1997
9. PERFORMANCE PREMISES FOR HUMAN RESOURCES FROM PUBLIC HEALTH ORGANIZATIONS IN ROMANIA
10. Conditions for learning: partnerships for engaging secondary pupils with contemporary art.