least on a practical probabilistic level, genre often moves
in close proximity to subject.
The present paper reports tests on two corpora of
genre-labelled PDF documents conducted to examine the
correlation between genre classes and three feature types,
to demonstrate that the best feature types for detecting any
one genre class are not necessarily the best for detecting
other genre classes. The feature types we will examine
are visual layout features, language modeling features and
stylistic word frequency. Initially our corpus has been
confined to one document format to narrow down the
problem space. We have chosen PDF as this format
because a tool for this format is likely to have immediate
wide spread application given its popularity across
library, archival, commercial and private sectors. The
methods described here, however, do not use features
dependent on elements available only in PDF documents.
The process is dependent on the PDF only in so far as it
depends on PDF tools to convert the documents into
image and text.
It is not the intention of this paper to introduce a
classifier optimised to perform genre classification (in
contrast to [12]). Here we put forward evidence that
establishing a correlation between feature types and genre
classes may be a reasonable step forward in constructing a
robust genre classification system.
2. Defining genre
Genre is a highly mutable context-dependent
concept. Its mutability is apparent in its usage across the
literature: Biber ([4]) characterised document genres
using five dimensions (information, narration,
elaboration, persuasion, abstraction), while others ([10],
[5]) examined the categorisation of documents into
common classes such as FAQ, Job Description, Editorial
or Reportage. Genre classification have sometimes been
defined as the analysis of particular aspects (narratives,
fact versus opinion, intended level of audience, and,
positivity or negativity of opinion) of text ([11], [9]), and
even used to describe the detection of selected journals
and brochures from one another using visual layout ([1]).
Others ([17], [2]) have clustered documents into similar
feature groups without delving into genre facets or
classes, and some have championed a multi-genre schema
for web page classification ([19], [20]). Santini has
reviewed different approaches to genre classification
([18]).
While the definition of genre may not be easily
pinned down, there is general agreement that genre is a
concept used to categorise documents by structure and
function. In fact, the structure of documents in the genre
evolve to meet the functional requirements for its survival
in the environment for which it was created, much the
same as the structure of an organism evolves to optimise
its survival function in the natural environment (cf. [13]).
The accepted layout, language, components and style of
the document change dynamically to maximise its
chances of fulfilling its role as
• a piece of communication reflecting the intention
of the creator,
• a source of information for distribution to a user
community,
• a part of a process such as publication,
recruitment, or event,
• a type of data structure for representing
information.
In this context, it seems intuitively clear that selected
features will be dependent on one of five aspects: visual
layout, style, topic, semantic patterns, and contextual
elements which reflect the process for which the
document was created and used (cf. [12]) .
The proposed objective in this paper is to study
these feature types in relation to genre classes to
determine its effectiveness in the detection of visual
genres (e.g. data structure type), stylistic genres (e.g.
prescribed procedural style) and topical genres (e.g.
business versus legal briefing paper) independently. To
this end, we first examine white space analysis, stylistic
term frequency and significant term analysis in relation to
genre classification. Subsequently we will enrich this
basic set to examine more sophisticated features. It seems
important to keep a check on the number of parameters in
the first analysis.
3. Data
A common problem in the study of automated
genre classification is the lack of established experimental
data. A limited classification of documents into genre is
available in previously constructed datasets, but none of
them span a large number of genres, nor do they employ a
consistent schema. To alleviate the paucity of data, we
have created two corpora which we describe in this
section.
3.1. Corpora
There are two independent corpora which
have been constructed in our research:
RAGGED (RAndomly Generated GEnre Data)