Examining Variations of Prominent Features in Genre Classification



Figure 1. Examples of document image representation: Scientific Article (left) and Magazine (right).


Our interest in image features reflects the
recognition that documents of certain genres have more
white space on the first page (e.g. title page of the book),
are ruled by formatting conventions (e.g. first slide for a
conference presentation), and are made visually elaborate
to attract readership (e.g. the reversal of black and white
on a magazine cover). Another benefit of examining
documents using image processing methods is that the
process does not depend on extracting text, can be
language independent, and supports document analysis
even when the content of the document is only accessible
as an image. Examples of the image representation are
given in Figure 1.

The features in style are intended to capture
frequency of words popular to all genres as well as words
which are only prolific within some genres. A typical
example of the weight of this feature is illustrated in the
fact that forms or slides are likely to contain a fewer
number of definite or indefinite articles than flowing text.
Sample average frequencies of words commonly found in
three of the genres discussed in the current study are
presented in Table 1 (the average is taken over ten
documents). In the reported experiment we have taken
words which were found in 75% of the files in each genre.
We have also tried with a few other percentages but found
this to show the best results. To compile the common
words, we also tried a focused method of compiling words
from the six genres under consideration only, and even
words from a range of genres which exclude the genres of
interest. The combined list was adopted in the end
because higher accuracies were consistently observed in
all three style-based classifiers when using this list in
comparison to the other two lists.

Table 1. Average frequency of words per document
across three genres.

Genre

Word

Business

Report

Thesis

Minutes

have

47

109

0

with

71

210

13

do

11

0

0

case

0

10

0

meeting

0

0

8

information

12

0

0

5. Experiments

5.1. Method

Eight classifiers (image NB, image SVM, image
RF, style NB, style NB, style SVM, style RF, Rainbow
NB, Rainbow SVM)
have been tested on Dataset I and II
for their performance in recognising six genre classes
including Academic Monograph, Book of Fiction,
Business Report, Minutes, Periodicals, and Thesis. The
performance is examined using 10-fold cross validation
results.

The performances of the eight classifiers are
first evaluated to identify, for each feature type, the
statistical methods that generate the best overall
performances on Dataset I and II (Section 6.1). Then, on
each dataset, the best classifiers, one for each feature type,



More intriguing information

1. Pursuit of Competitive Advantages for Entrepreneurship: Development of Enterprise as a Learning Organization. International and Russian Experience
2. The name is absent
3. PACKAGING: A KEY ELEMENT IN ADDED VALUE
4. On the Desirability of Taxing Charitable Contributions
5. APPLICATIONS OF DUALITY THEORY TO AGRICULTURE
6. The name is absent
7. Globalization, Divergence and Stagnation
8. The name is absent
9. The name is absent
10. Are class size differences related to pupils’ educational progress and classroom processes? Findings from the Institute of Education Class Size Study of children aged 5-7 Years
11. The name is absent
12. EXECUTIVE SUMMARY
13. The name is absent
14. HOW WILL PRODUCTION, MARKETING, AND CONSUMPTION BE COORDINATED? FROM A FARM ORGANIZATION VIEWPOINT
15. The Impact of Hosting a Major Sport Event on the South African Economy
16. The name is absent
17. Activation of s28-dependent transcription in Escherichia coli by the cyclic AMP receptor protein requires an unusual promoter organization
18. Implementation of Rule Based Algorithm for Sandhi-Vicheda Of Compound Hindi Words
19. Washington Irving and the Knickerbocker Group
20. The name is absent