Examining Variations of Prominent Features in Genre Classification



Figure 1. Examples of document image representation: Scientific Article (left) and Magazine (right).


Our interest in image features reflects the
recognition that documents of certain genres have more
white space on the first page (e.g. title page of the book),
are ruled by formatting conventions (e.g. first slide for a
conference presentation), and are made visually elaborate
to attract readership (e.g. the reversal of black and white
on a magazine cover). Another benefit of examining
documents using image processing methods is that the
process does not depend on extracting text, can be
language independent, and supports document analysis
even when the content of the document is only accessible
as an image. Examples of the image representation are
given in Figure 1.

The features in style are intended to capture
frequency of words popular to all genres as well as words
which are only prolific within some genres. A typical
example of the weight of this feature is illustrated in the
fact that forms or slides are likely to contain a fewer
number of definite or indefinite articles than flowing text.
Sample average frequencies of words commonly found in
three of the genres discussed in the current study are
presented in Table 1 (the average is taken over ten
documents). In the reported experiment we have taken
words which were found in 75% of the files in each genre.
We have also tried with a few other percentages but found
this to show the best results. To compile the common
words, we also tried a focused method of compiling words
from the six genres under consideration only, and even
words from a range of genres which exclude the genres of
interest. The combined list was adopted in the end
because higher accuracies were consistently observed in
all three style-based classifiers when using this list in
comparison to the other two lists.

Table 1. Average frequency of words per document
across three genres.

Genre

Word

Business

Report

Thesis

Minutes

have

47

109

0

with

71

210

13

do

11

0

0

case

0

10

0

meeting

0

0

8

information

12

0

0

5. Experiments

5.1. Method

Eight classifiers (image NB, image SVM, image
RF, style NB, style NB, style SVM, style RF, Rainbow
NB, Rainbow SVM)
have been tested on Dataset I and II
for their performance in recognising six genre classes
including Academic Monograph, Book of Fiction,
Business Report, Minutes, Periodicals, and Thesis. The
performance is examined using 10-fold cross validation
results.

The performances of the eight classifiers are
first evaluated to identify, for each feature type, the
statistical methods that generate the best overall
performances on Dataset I and II (Section 6.1). Then, on
each dataset, the best classifiers, one for each feature type,



More intriguing information

1. The name is absent
2. The name is absent
3. Economie de l’entrepreneur faits et théories (The economics of entrepreneur facts and theories)
4. L'organisation en réseau comme forme « indéterminée »
5. The economic doctrines in the wine trade and wine production sectors: the case of Bastiat and the Port wine sector: 1850-1908
6. Computational Experiments with the Fuzzy Love and Romance
7. Restricted Export Flexibility and Risk Management with Options and Futures
8. The name is absent
9. FISCAL CONSOLIDATION AND DECENTRALISATION: A TALE OF TWO TIERS
10. Globalization, Redistribution, and the Composition of Public Education Expenditures