Examining Variations of Prominent Features in Genre Classification



are compared in detail across the six genres (Section 6.2
and 6.3).

5.2. Evaluation

The results, apart from those reported in Section
6.3, have been evaluated with three conventional metrics
for classification: accuracy, precision and recall. To make
precise what we mean by these terms, let
N be the total
number of documents in the test data,
Nc the number of
documents in the class
C, TP(C) the number of
documents correctly predicted to be a member of class
C,
and FP(C) the number of documents incorrectly predicted
as belonging to class
C. Accuracy A is defined to be
A= TP1C1

precision P(C) of class C is defined to be

P C = TP1C

' ' TPC) +FPC) ,

and, recall, R( C), of class C is defined to be

R TP .

c

Although some debate surrounds the suitability of
accuracy, precision and recall as a measurement of
information retrieval tasks, for classification tasks, they
are still deemed to be a reasonable indicator of classifier
performance.

6. Results

6.1. Overall accuracy

The overall accuracies of classifiers built on each
feature type across statistical methods is reported in Table
2 (best performances are indicated in bold-face).

The tests on the two datasets, consistently
indicate Naive Bayes as the best statistical method for
image features. Although the overall accuracies of Naive
Bayes and Random Forest are comparable on the larger
dataset, averaging (with a heavier weight on the larger set)
the performances on the two datasets, suggested Naive
Bayes as a better performer for image. On both datasets,
Support Vector Machine and Random Forest are both
better than Naive Bayes for style features. Although
Support Vector Machine and Random Forest performs
comparably on the smaller Dataset I, we have chosen
Random Forest as the better choice for style, because the
difference was shown to be prominent on Dataset II. We
have chosen Naive Bayes for Rainbow for comparison on
Dataset I, and Support Vector Machine for Rainbow on
Dataset II: in both cases the difference in performance
was too large to indicate an overall better method for
Rainbow.

In passing, we observe that, based on the overall
accuracies of the classifiers on the two datasets, the
classifiers based on image features are the least affected
by training dataset size (average difference in accuracy
0.036) and the classifiers based on Rainbow are the most
affected by dataset size (average difference in accuracy
0.328). Also the results indicate that Support Vector
machine and Random Forest seem more affected by
dataset size than Naive Bayes.

6.2. Precision and recall

In this section we compare the precision and
recall across genres of the classifiers for each feature type
which have been shown to have the best overall accuracies
in the previous section (on Dataset I,
image NB, style RF
and Rainbow NB; on Dataset II, image NB, style RF,
Rainbow SVM). The figures in Tables 3 and 4 show the
precision and recall across the six genres of each classifier
tested on Dataset I and II. The genres are indicated in the
left most column of the tables, with the numbers of
documents in each class noted in parenthesis. The
classifiers being tested are indicated in parenthesis at the
top of each of the following columns.

Table 2. Overall accuracy of feature types across statistical methods

Feature type

Data & method

Dataset I (103 items)

Dataset II (494 items)

NB

SVM

RF

NB

SVM

RF

image

0.524

0.35

0.417

0.48

0.395

0.48

style

0.505

0.573

0.641

0.63

0.724

0.828

Rainbow

0.428

0.25

N/A

0.618

0.715

N/A



More intriguing information

1. Land Police in Mozambique: Future Perspectives
2. The name is absent
3. Økonomisk teorihistorie - Overflødig information eller brugbar ballast?
4. The name is absent
5. GROWTH, UNEMPLOYMENT AND THE WAGE SETTING PROCESS.
6. Distortions in a multi-level co-financing system: the case of the agri-environmental programme of Saxony-Anhalt
7. A Consistent Nonparametric Test for Causality in Quantile
8. The name is absent
9. Alzheimer’s Disease and Herpes Simplex Encephalitis
10. The name is absent
11. The name is absent
12. The name is absent
13. THE RISE OF RURAL-TO-RURAL LABOR MARKETS IN CHINA
14. MATHEMATICS AS AN EXACT AND PRECISE LANGUAGE OF NATURE
15. Fertility in Developing Countries
16. Evaluating Consumer Usage of Nutritional Labeling: The Influence of Socio-Economic Characteristics
17. The name is absent
18. Text of a letter
19. The name is absent
20. Ventas callejeras y espacio público: efectos sobre el comercio de Bogotá