Graphical Data Representation in Bankruptcy Analysis
the penalty ξi is introduced, which is related to the distance from the hyper-
plane bounding observations of the same class to observation i. ξi > 0 if a
misclassification occurs. All observations satisfy the following two constraints:
yi(Xw + b) ≥ 1 - ξi, (1)
ξi ≥ 0. (2)
With the normalisation of w, b and ξi as in (1) the margin equals to 2/ ∣∣w∣∣.
The convex objective function to be minimised given the constraints (1) and
(2) is:
1n
2 ∣∣w∣∣2 + C∑ξi∙ (3)
i=1
The parameter C called capacity is related to the width of the margin
zone. The smaller the C is, the bigger margins are possible. Using well estab-
lished theory for optimisation of convex functions [6] we can derive the dual
Lagrangian:
1 nn n n
Ld = 2w(α)τw(α) - ∑ α - ∑ δiθi + ∑ γi (o⅛ - C) - β ∑ α⅛y (4)
for the dual problem:
min max LD, (5)
αi,δi,γi,β wk ,b,ξi
Here for a linear SVM:
nn
w(α)τ w(α) = ΣΣαiαjyiyjxiτxj . (6)
i=1 j=1
For obtaining non-linear classifying functions in the data space a more general
form is applicable:
nn
w(α)τ w(α) = ΣΣαiαjyiyjK(xi, xj). (7)
The function K(xi, xj ) is called a kernel function. Since it has a closed form
representation, the kernel is a convenient way of mapping low dimensional data
into a highly dimensional (often infinitely dimensional) space of features. It
must satisfy the Mercer conditions [15], i.e. be symmetric and semipositive
definite or, in other words, represent a scalar product in some Hilbert space
[19].
In our study we applied an SVM with an anisotropic Gaussian kernel
K(xi, xj) = exp -(xi - xj)τr-2Σ-1(xi - xj)/2 , (8)
where r is a coefficient and Σ is a variance-covariance matrix. The coefficient
r is related to the complexity of classifying functions: the hgher the r is, the
lower is the complexity. If kernel functions allow for sufficiently rich feature
spaces, the performances of SVMs are comparable in terms of out-of-sample
forecasting accuracy [18].