AN ANALYTICAL METHOD TO CALCULATE THE ERGODIC AND DIFFERENCE MATRICES OF THE DISCOUNTED MARKOV DECISION PROCESSES



we receive

n-1

ν (n, β) = βn ∙ Pn ∙ ν (0) + X βn ∙ Pn ∙ q.                 (9)

n=0

For n → ∞ and β < 1 it follows to the next formula:

ν(β) = X βn ∙ Pn ∙ q = (I — βP)-1 ∙ q.               (10)

n=0

Formula (10) allows to calculate total expected rewards if factor β and starting
state i = 1, . . . , N were given. We should pay attention to fact that this value
is finite for β < 1. This formula is well- known in literature and often used
for calculation of mentioned process rewards in long period of time in case if real
process can be modelled by means of MDP. We can optimize decision process
by means of formula (10) if we can choose different strategies of behaviour during
analysis of real process [18, 25]. The choice of optimal decision means choice at i-
th state of process so strategy of behaviour which gives maximum total expected
reward. As we know the fact can be achieved using iterative Howard’s algorithms
for recurrent process (or it later version).

Discussed formula does not allow to analyse the process during the whole inve-
stigated period of time t
[0, ∞).

This problem was solved for discrete and continuous Markov processes without
discount by Howard by means of z transformation (for discrete processes) and La-
place’a transformation (for continuos processes) for total rewards. After inverse
transformation new formulae in explicit depend on n. The dependence is a sum
of n-components. Ergodic matrix, which depends on n, is always the first com-
ponent. Next N - 1 components are named difference matrices and they also
depend on n. The sum of elements of each row for differential matrices is always
equal zero. For n → ∞, components approach zero. But the total expected re-
ward approaches in this case infinity.

Next we show analytical method of calculation of ergodic matrix and differen-
ce matrices for Discounted Markov Chain with irreducible stochastic matrix
P and finite set of states N which is based on approach proposed by Howard [25].
It allows to get finite total expected rewards ν
(β) which can be characterised
by two components: the first component represents finite part of constant reward
which is connected with ergodic matrix and the second component represents
finite part of variable reward which is connected with transient states of Mar-
kov Process. The part is a sum of rewards connected with difference matrices.
So we can now analyse quality of investigated Markov Processes by comparison
of constant and variable part of total reward in infinite period of time.



More intriguing information

1. The Advantage of Cooperatives under Asymmetric Cost Information
2. Multi-Agent System Interaction in Integrated SCM
3. PERFORMANCE PREMISES FOR HUMAN RESOURCES FROM PUBLIC HEALTH ORGANIZATIONS IN ROMANIA
4. The Employment Impact of Differences in Dmand and Production
5. Gender and headship in the twenty-first century
6. Measuring and Testing Advertising-Induced Rotation in the Demand Curve
7. Regional science policy and the growth of knowledge megacentres in bioscience clusters
8. Estimating the Impact of Medication on Diabetics' Diet and Lifestyle Choices
9. Wage mobility, Job mobility and Spatial mobility in the Portuguese economy
10. Reconsidering the value of pupil attitudes to studying post-16: a caution for Paul Croll
11. Design and investigation of scalable multicast recursive protocols for wired and wireless ad hoc networks
12. The name is absent
13. Evolutionary Clustering in Indonesian Ethnic Textile Motifs
14. The name is absent
15. Macro-regional evaluation of the Structural Funds using the HERMIN modelling framework
16. The name is absent
17. Mortality study of 18 000 patients treated with omeprazole
18. The name is absent
19. Moi individuel et moi cosmique Dans la pensee de Romain Rolland
20. Implementation of Rule Based Algorithm for Sandhi-Vicheda Of Compound Hindi Words