Examining Variations of Prominent Features in Genre Classification



Examining Variations of Prominent Features in Genre Classification

Yunhyong Kim and Seamus Ross
Digital Curation Centre

&

Humanities Advanced Technology and Information Institute
University of Glasgow, Glasgow, UK

{y.kim, s.ross}@hatii.arts.gla.ac.uk

Abstract

This paper investigates the correlation between
features of three types (visual, stylistic and topical types)
and genre classes. The majority of previous studies in
automated genre classification have created models based
on an amalgamated representation of a document using a
combination of features. In these models, the inseparable
roles of different features make it difficult to determine a
means of improving the classifier when it exhibits poor
performance in detecting selected genres. In this paper
we use classifiers independently modeled on three groups
of features to examine six genre classes to show that the
strongest features for making one classification is not
necessarily the best features for carrying out another
classification.

1. Introduction

The research described in this paper examines
genre classes of text documents and the role of different
types of features in distinguishing these classes
automatically. Automated genre classification (e.g.
classification into scientific research articles, news report,
or email), which identifies the function and structure of
the document, supports metadata extraction ([12]) and
other information extraction by performing a first-level
classification of documents into documents of similar
structure, facilitating focused search of information on
specific document types, and supporting the integration
of techniques developed to work within selected genres.

The features which characterise a text often fall
into well-defined groups. For example, some features
capture the position of text blocks (visual layout), some
describe indicative vocabulary (significant terms) and
others attempt to identify the pragmatics of selected
terms or functional category (style). In previous studies
of automated genre classification (e.g. [4], [5], [9], [11],
[19], [20]) these features have been combined to produce
a single set of features to represent the documents which
are to be classified. This approach optimises the overall
performance of the classifier on the detection of the pre-
defined classes but makes it difficult to devise a means of
improving the classifier when it displays poor
performance in detecting selected genres. It also takes for
granted that the predefined classes describe a comparable
schema of a single classification task.

In this paper we give evidence that genre
classification, as described in previous studies, may
actually be a combination of several independent tasks.
For example, the distinction between a Thesis and
Scientific Paper is largely structural, while Meeting
Minutes and Business Reports are mostly distinguished by
topic and style. On the other hand, the distinction between
a Table of Financial Statistics and a Financial Report lies
mainly in the visual representation and style. Using the
same features to model concurrently these different types
of classification would be equivalent to estimating a
single distribution for items which belong to distinct
populations. If you examine previous literature (e.g. Table
5 in [10], Table 3 in [11]), classification errors range
anywhere from seventeen percent to seventy-six percent
([10]), and six percent to eighty percent ([11]). Observing
such big differences in error rate might indicate that a re-
evaluation of the task, to determine if the task is actually
a combination of many tasks disguised by the single term
genre classification, would be productive.

Another prevailing notion in earlier analyses is
that genre classification is orthogonal to topic or subject
classification. This notion defines genre classification as
a task independent from subject classification. While
there may be a conceptual level at which this is true,
within the probabilistic framework on which language
processing is highly reliant, there is reason to believe that
this is not generally the case. For example, consider the
topic of
cohomology, a well-known subject area in higher
mathematics; this topic would not be expected to appear
as frequently in the genre class Reportage as it would in
the genre class Research Article. This suggests that, at



More intriguing information

1. EXPANDING HIGHER EDUCATION IN THE U.K: FROM ‘SYSTEM SLOWDOWN’ TO ‘SYSTEM ACCELERATION’
2. Evidence-Based Professional Development of Science Teachers in Two Countries
3. Text of a letter
4. Needing to be ‘in the know’: strategies of subordination used by 10-11 year old school boys
5. The name is absent
6. Pursuit of Competitive Advantages for Entrepreneurship: Development of Enterprise as a Learning Organization. International and Russian Experience
7. Measuring Semantic Similarity by Latent Relational Analysis
8. Standards behaviours face to innovation of the entrepreneurships of Beira Interior
9. The Provisions on Geographical Indications in the TRIPS Agreement
10. The name is absent
11. TOWARDS THE ZERO ACCIDENT GOAL: ASSISTING THE FIRST OFFICER MONITOR AND CHALLENGE CAPTAIN ERRORS
12. Monetary Discretion, Pricing Complementarity and Dynamic Multiple Equilibria
13. The name is absent
14. The name is absent
15. The Role of State Trading Enterprises and Their Impact on Agricultural Development and Economic Growth in Developing Countries
16. The name is absent
17. Credit Markets and the Propagation of Monetary Policy Shocks
18. New Evidence on the Puzzles. Results from Agnostic Identification on Monetary Policy and Exchange Rates.
19. Regional dynamics in mountain areas and the need for integrated policies
20. CREDIT SCORING, LOAN PRICING, AND FARM BUSINESS PERFORMANCE