Examining Variations of Prominent Features in Genre Classification



Examining Variations of Prominent Features in Genre Classification

Yunhyong Kim and Seamus Ross
Digital Curation Centre

&

Humanities Advanced Technology and Information Institute
University of Glasgow, Glasgow, UK

{y.kim, s.ross}@hatii.arts.gla.ac.uk

Abstract

This paper investigates the correlation between
features of three types (visual, stylistic and topical types)
and genre classes. The majority of previous studies in
automated genre classification have created models based
on an amalgamated representation of a document using a
combination of features. In these models, the inseparable
roles of different features make it difficult to determine a
means of improving the classifier when it exhibits poor
performance in detecting selected genres. In this paper
we use classifiers independently modeled on three groups
of features to examine six genre classes to show that the
strongest features for making one classification is not
necessarily the best features for carrying out another
classification.

1. Introduction

The research described in this paper examines
genre classes of text documents and the role of different
types of features in distinguishing these classes
automatically. Automated genre classification (e.g.
classification into scientific research articles, news report,
or email), which identifies the function and structure of
the document, supports metadata extraction ([12]) and
other information extraction by performing a first-level
classification of documents into documents of similar
structure, facilitating focused search of information on
specific document types, and supporting the integration
of techniques developed to work within selected genres.

The features which characterise a text often fall
into well-defined groups. For example, some features
capture the position of text blocks (visual layout), some
describe indicative vocabulary (significant terms) and
others attempt to identify the pragmatics of selected
terms or functional category (style). In previous studies
of automated genre classification (e.g. [4], [5], [9], [11],
[19], [20]) these features have been combined to produce
a single set of features to represent the documents which
are to be classified. This approach optimises the overall
performance of the classifier on the detection of the pre-
defined classes but makes it difficult to devise a means of
improving the classifier when it displays poor
performance in detecting selected genres. It also takes for
granted that the predefined classes describe a comparable
schema of a single classification task.

In this paper we give evidence that genre
classification, as described in previous studies, may
actually be a combination of several independent tasks.
For example, the distinction between a Thesis and
Scientific Paper is largely structural, while Meeting
Minutes and Business Reports are mostly distinguished by
topic and style. On the other hand, the distinction between
a Table of Financial Statistics and a Financial Report lies
mainly in the visual representation and style. Using the
same features to model concurrently these different types
of classification would be equivalent to estimating a
single distribution for items which belong to distinct
populations. If you examine previous literature (e.g. Table
5 in [10], Table 3 in [11]), classification errors range
anywhere from seventeen percent to seventy-six percent
([10]), and six percent to eighty percent ([11]). Observing
such big differences in error rate might indicate that a re-
evaluation of the task, to determine if the task is actually
a combination of many tasks disguised by the single term
genre classification, would be productive.

Another prevailing notion in earlier analyses is
that genre classification is orthogonal to topic or subject
classification. This notion defines genre classification as
a task independent from subject classification. While
there may be a conceptual level at which this is true,
within the probabilistic framework on which language
processing is highly reliant, there is reason to believe that
this is not generally the case. For example, consider the
topic of
cohomology, a well-known subject area in higher
mathematics; this topic would not be expected to appear
as frequently in the genre class Reportage as it would in
the genre class Research Article. This suggests that, at



More intriguing information

1. Keystone sector methodology:network analysis comparative study
2. Distribution of aggregate income in Portugal from 1995 to 2000 within a SAM (Social Accounting Matrix) framework. Modeling the household sector
3. WP 92 - An overview of women's work and employment in Azerbaijan
4. A Theoretical Growth Model for Ireland
5. Explaining Growth in Dutch Agriculture: Prices, Public R&D, and Technological Change
6. Social Irresponsibility in Management
7. O funcionalismo de Sellars: uma pesquisa histδrica
8. ARE VOLATILITY EXPECTATIONS CHARACTERIZED BY REGIME SHIFTS? EVIDENCE FROM IMPLIED VOLATILITY INDICES
9. How much do Educational Outcomes Matter in OECD Countries?
10. Globalization and the benefits of trade
11. The WTO and the Cartagena Protocol: International Policy Coordination or Conflict?
12. Better policy analysis with better data. Constructing a Social Accounting Matrix from the European System of National Accounts.
13. Moffett and rhetoric
14. PEER-REVIEWED FINAL EDITED VERSION OF ARTICLE PRIOR TO PUBLICATION
15. The name is absent
16. The name is absent
17. The name is absent
18. AN ECONOMIC EVALUATION OF THE COLORADO RIVER BASIN SALINITY CONTROL PROGRAM
19. AN EXPLORATION OF THE NEED FOR AND COST OF SELECTED TRADE FACILITATION MEASURES IN ASIA AND THE PACIFIC IN THE CONTEXT OF THE WTO NEGOTIATIONS
20. The name is absent