AN ANALYTICAL METHOD TO CALCULATE THE ERGODIC AND DIFFERENCE MATRICES OF THE DISCOUNTED MARKOV DECISION PROCESSES

we receive

n-1

ν (n, β) = βⁿ ∙ Pⁿ ∙ ν (0) + X βⁿ ∙ Pⁿ ∙ q. (9)

n=0

For n → ∞ and β < 1 it follows to the next formula:

∞

ν∞ (β) = X βn ∙ Pn ∙ q = (I — βP)^-1 ∙ q. (10)

n=0

Formula (10) allows to calculate total expected rewards if factor β and starting
state i = 1, . . . , N were given. We should pay attention to fact that this value
is finite for β < 1. This formula is well- known in literature and often used
for calculation of mentioned process rewards in long period of time in case if real
process can be modelled by means of MDP. We can optimize decision process
by means of formula (10) if we can choose different strategies of behaviour during
analysis of real process [18, 25]. The choice of optimal decision means choice at i-
th state of process so strategy of behaviour which gives maximum total expected
reward. As we know the fact can be achieved using iterative Howard’s algorithms
for recurrent process (or it later version).

Discussed formula does not allow to analyse the process during the whole inve-
stigated period of time t ∈ [0, ∞).

This problem was solved for discrete and continuous Markov processes without
discount by Howard by means of z transformation (for discrete processes) and La-
place’a transformation (for continuos processes) for total rewards. After inverse
transformation new formulae in explicit depend on n. The dependence is a sum
of n-components. Ergodic matrix, which depends on n, is always the first com-
ponent. Next N - 1 components are named difference matrices and they also
depend on n. The sum of elements of each row for differential matrices is always
equal zero. For n → ∞, components approach zero. But the total expected re-
ward approaches in this case infinity.

Next we show analytical method of calculation of ergodic matrix and differen-
ce matrices for Discounted Markov Chain with irreducible stochastic matrix
P and finite set of states N which is based on approach proposed by Howard [25].
It allows to get finite total expected rewards ν∞ (β) which can be characterised
by two components: the first component represents finite part of constant reward
which is connected with ergodic matrix and the second component represents
finite part of variable reward which is connected with transient states of Mar-
kov Process. The part is a sum of rewards connected with difference matrices.
So we can now analyse quality of investigated Markov Processes by comparison
of constant and variable part of total reward in infinite period of time.

More intriguing information

1. The name is absent
2. The name is absent
3. WP 48 - Population ageing in the Netherlands: Demographic and financial arguments for a balanced approach
4. Understanding the (relative) fall and rise of construction wages
5. Examining the Regional Aspect of Foreign Direct Investment to Developing Countries
6. ModellgestÃ¼tzte Politikberatung im Naturschutz: Zur â€žoptimalenâ€œ FlÃ¤chennutzung in der Agrarlandschaft des BiosphÃ¤renreservates â€žMittlere Elbeâ€œ
7. Education Responses to Climate Change and Quality: Two Parts of the Same Agenda?
8. Regulation of the Electricity Industry in Bolivia: Its Impact on Access to the Poor, Prices and Quality
9. Real Exchange Rate Misalignment: Prelude to Crisis?
10. The name is absent