This dataset consists of 570 PDF documents gathered
from the Internet using random search words. For the
retrieval of each item, the algorithm selects a random
word from SCOWL (Spell Checker Oriented Word List -
available from sourceforge.net), retrieves a list of PDFs
containing the search word, and then saves a random
document from the first hundred documents returned. The
data gathered by this method was labelled by one of the
authors of this paper.
KRYS I
This corpus consists of documents belonging to one of the
seventy genres described in Table 1 of [14] . The corpus
was constructed through a document retrieval exercise
where university students were assigned genres from
Table 1 of [14], and, for each genre, asked to retrieve from
the Internet one hundred examples of that genre
represented in PDF and written in English. They were not
given any descriptions of the genres apart from the genre
label. Instead, they were asked to describe their reasons
for including the particular example in the set. For some
genres, the students were unable to identify and acquire
one hundred examples. The resulting corpus now includes
6478 items.
3.2 Experimental data
The experiments analysed in Section 6 have been
conducted on two datasets, a subset of RAGGED (Dataset
I) and a subset of KRYS I (Dataset II), consisting of all
the documents in the corpora initially labelled as one of
six genres including Academic Monograph (AM),
Business Report (BR), Book of Fiction (BF), Minutes
(M), Periodicals (P), and Thesis (T). The experimental
Dataset I comprises 16 examples of AM, 16 examples of
BR, 15 examples of BF, 19 examples of M, 19 examples of
P, and 18 examples of T, while Dataset II comprises 99
examples of AM, 29 examples of BF, 100 examples of
BR, 99 examples of M, 67 examples of P, 100 examples of
T. The low proportion of Book of Fiction and Periodicals
in Dataset II is due to the difficulty in finding publicly
available examples of that genre.
4. Classifiers
Eight classifiers are examined in this paper. They
are trained on three feature types and three statistical
methods. The three statistical methods employed are
Naive Bayes (NB) [16], Support Vector Machine (SVM)
[26] and Random Forest (RF) [6]. We have called the
three feature types image, style and Rainbow. The
features. image and style, have been modeled for
comparison using all three statistical methods using the
Weka machine learning toolkit ([21]). The features
represented by Rainbow, on the other hand, have been
modeled on Naive Bayes and Support Vector Machine.
The features of Rainbow are native to the Rainbow
module of the BOW toolkit ([15]) and Random Forest
(which was developed at a later date) was unfortunately
not built into the module. We will refer to the eight
classifiers by naming them with the feature type followed
by the abbreviation for the statistical method (e.g. image
NB for image feature Naive Bayes classifier). The
parameters for feature selection and statistical methods
have been optimised over a finite set of combinations
tested for best overall accuracy on several samples taken
from RAGGED. The final feature selection method is
described below.
Image features: The first page of the document
was converted into a low resolution grey-scale image and
sectioned into a N x N grid. Each region on the grid was
examined for non-white pixels. All regions with non-
white pixels were assigned a value of 1 and all the other
regions were assigned a value of 0, to create a low
resolution bit map. Several grid sizes were tested on
samples taken from RAGGED, but we found N=62 to
produce the best results. This was also the coarsest level
of granularity at which human subjects were able to
distinguish particular documents as members of specific
genre classes.
Style features: From an independent dataset
consisting of documents retrieved from the Internet, the
union of all words found to be common amongst the files
in each genre class was compiled into a list. The dataset
used in this process consisted of 190 documents
belonging to nineteen genres inclusive of the six genres
being examined in this paper. The thirteen complementary
genre classes include Abstract, Magazine Article,
Scientific Research Article, Forms, Technical Manual,
Technical Report, Email, Memo, Advertisement, Exam
Worksheet, Slides, Speech Transcript, Poster. The test
documents were represented by a vector constructed using
the frequency of each word in the compiled list.
Rainbow features: This is a text classifier
included in the BOW toolkit developed by McCallum
([15]). This toolkit indexes the alpha-numeric content of
the text for an analysis of significant terms, to estimate the
probability of each word against each class. We have
adopted the default setting of using a stop-word list to
capture significant topical words of documents. The
rainbow classifier is popular with subject classification.