Some descriptive statistics characterising M1 , M2 and M3 are summarised in Table 1.
As can be seen from this table there are no large differences between the training,
validation and test sets. There are, nevertheless, differences, especially in tij , which will
present some challenge to the estimation procedure used.
5.3 Model Estimation and the Overfitting Problem
Deciding on an appropriate number, H, of product units and on the value for the
Alopex-parameters δ (the step size) is somewhat discretionary, involving the familiar
trade-off between speed and accuracy. The approach adopted for this evaluation was
stopped (cross-validation) training. The Alopex-parameters T and N were set to 1,000
and 10, respectively.
It is worth emphasising that the training process is sensitive to its starting point. Despite
recent progress in finding the most appropriate parameter initialisation that would help
Alopex to find near optimal solutions, the most widely adopted approach still uses
random weight initialisation in order to reduce fluctuation in evaluation. Each
experiment employed to determine H and δ was repeated 60 times, the model being
initialised with a different set of random weights before each trial. Random numbers
were generated from [-0.3, 0.3] using the rand_uni function from Press et al. (1992).
The order of the input data presentation was kept constant for each run to eliminate its
effect on the result. The training process was stopped when к - 40,000 consecutive
iterations were unsuccessful.
Extensive computational experiments with different combinations of H- and δ -values
have been performed on a DEC Alpha 375 Mhz. Table 2 summarises the results of the
most important ones. Training Performance is measured in terms of ARV(M1) and
validation performance in terms of ARV(M2). The performance values represent the
mean of the 60 simulations, standard deviations are given in brackets. Since all
simulations have similar computational complexity, iterations to converge to the
minimal ARV(M2)-value may be used as a measure of learning time. It is easy to see
that the combination of H = 16 and δ = 0.0025 provides an appropriate choice for our
particular application.
24