Graphical Data Representation in Bankruptcy Analysis
We start model selection from the simplest, i.e. univariate models and
then pick up the one with the highest AR. The problem that arises is how
to determine the variable which provides the highest AR across possible data
samples. For a parametric model we would need to estimate the distribution
of the coefficients at the variables and, hence, their confidence intervals. This
approach, however, is practically irrelevant for non-parametric models.
Instead we can compare goodness of models with respect to some accuracy
measure, in our case AR. Firstly we will estimate the distributions of AR for
different models. This can be done using bootstrapping [12]. We randomly
select training and validation sets as subsamples of 500 solvent and 500 in-
solvent companies each. We used the 50/50 ratio since this is the worst case
with the minimum AR. The two sets are not overlapping, i.e. do not contain
common observations. For each of these sets we apply the SVM with parame-
ters that provide the highest AR for bivariate models (Figure 7) and estimate
ARs. Then we perform a Monte Carlo experiment: repeat the generation of
subsamples and computing of ARs 100 times. Each time we will record the
ARs and then estimate their distribution.
At the end of this procedure we obtain an empirically estimated distribu-
tion of AR on bootstrapped subsamples. The median AR provides a robust
measure to compare different variables as predictors. The same approach can
be used for comparing SVM with DA and logit regression in terms of predic-
tive power. We compute AR for the same subsamples with the SVM, DA and
logit models. The median improvements in AR for the SVM over DA and the
SVM over the logistic regression are also reported below.
We will start this procedure with all univariate models with 33 variables
K1-K9, K11-K33 as they are denoted at the Bundesbank and variable K10,
which is a standard normal random variable used as a reference (Table 1).
For each model the resulting distribution of ARs will be represented as box
plots (Figure 8). The red line depicts medians. The box within each box plot
shows the interquartile range (IQR), while the whiskers span to the distance
of 3/2 IQR in each direction from the median. Outliers beyond that range are
denoted with circles.
Basing on Figure 8 we can conclude that variables K5 (Debt Cover) and
K29 (Interest Coverage Ratio) provide the highest median AR around 50%.
We can also notice that variables K12, K26 and K28 have a very low accuracy:
their median ARs do not exceed 11.5%. The model based on random variable
K10 has AR equal zero, in other words, it has no predictive power whatsoever.
For the next step we will select variable K5 that was included in the best
univariate model.
For bivariate models we will select the best predictor from the univariate
models (K5) and one of the rest that delivers the highest AR (K29) (Figure
9). This procedure will be repeated for each new variable added. The AR is
growing until the model has eight variables, then it slowly declines. Median
ARs for the models with eight variables are shown in Figure 10.