Examining Variations of Prominent Features in Genre Classification
Yunhyong Kim and Seamus Ross
Digital Curation Centre
&
Humanities Advanced Technology and Information Institute
University of Glasgow, Glasgow, UK
{y.kim, s.ross}@hatii.arts.gla.ac.uk
Abstract
This paper investigates the correlation between
features of three types (visual, stylistic and topical types)
and genre classes. The majority of previous studies in
automated genre classification have created models based
on an amalgamated representation of a document using a
combination of features. In these models, the inseparable
roles of different features make it difficult to determine a
means of improving the classifier when it exhibits poor
performance in detecting selected genres. In this paper
we use classifiers independently modeled on three groups
of features to examine six genre classes to show that the
strongest features for making one classification is not
necessarily the best features for carrying out another
classification.
1. Introduction
The research described in this paper examines
genre classes of text documents and the role of different
types of features in distinguishing these classes
automatically. Automated genre classification (e.g.
classification into scientific research articles, news report,
or email), which identifies the function and structure of
the document, supports metadata extraction ([12]) and
other information extraction by performing a first-level
classification of documents into documents of similar
structure, facilitating focused search of information on
specific document types, and supporting the integration
of techniques developed to work within selected genres.
The features which characterise a text often fall
into well-defined groups. For example, some features
capture the position of text blocks (visual layout), some
describe indicative vocabulary (significant terms) and
others attempt to identify the pragmatics of selected
terms or functional category (style). In previous studies
of automated genre classification (e.g. [4], [5], [9], [11],
[19], [20]) these features have been combined to produce
a single set of features to represent the documents which
are to be classified. This approach optimises the overall
performance of the classifier on the detection of the pre-
defined classes but makes it difficult to devise a means of
improving the classifier when it displays poor
performance in detecting selected genres. It also takes for
granted that the predefined classes describe a comparable
schema of a single classification task.
In this paper we give evidence that genre
classification, as described in previous studies, may
actually be a combination of several independent tasks.
For example, the distinction between a Thesis and
Scientific Paper is largely structural, while Meeting
Minutes and Business Reports are mostly distinguished by
topic and style. On the other hand, the distinction between
a Table of Financial Statistics and a Financial Report lies
mainly in the visual representation and style. Using the
same features to model concurrently these different types
of classification would be equivalent to estimating a
single distribution for items which belong to distinct
populations. If you examine previous literature (e.g. Table
5 in [10], Table 3 in [11]), classification errors range
anywhere from seventeen percent to seventy-six percent
([10]), and six percent to eighty percent ([11]). Observing
such big differences in error rate might indicate that a re-
evaluation of the task, to determine if the task is actually
a combination of many tasks disguised by the single term
genre classification, would be productive.
Another prevailing notion in earlier analyses is
that genre classification is orthogonal to topic or subject
classification. This notion defines genre classification as
a task independent from subject classification. While
there may be a conceptual level at which this is true,
within the probabilistic framework on which language
processing is highly reliant, there is reason to believe that
this is not generally the case. For example, consider the
topic of cohomology, a well-known subject area in higher
mathematics; this topic would not be expected to appear
as frequently in the genre class Reportage as it would in
the genre class Research Article. This suggests that, at