7. Conclusions
The results in this paper provide evidence that
genre classification is a multi-dimensional task possibly
composed of several classification tasks involving a
varying distribution of feature type strengths as
distinguishing factors.
The research proposes expressing document
genre classes in context as an array of varying strengths
across several feature types. This will not only help us to
determine a means of supplementing deficiencies in
current classification methods by suggesting causes of
failure in detecting selected genres, but will also enable us
to relate documents classes from different classification
schema via similar or dissimilar distribution patterns.
8. Acknowledgments
Omitted in the draft. To be inserted later.
10. References
[1] Bagdanov, A. and Worring, M. (2001) Fine-grained
document genre classification using first order random graphs.
In Proceedings of the Sixth International Conference on
Document Analysis and Recognition (ICDAR2001) , 79-90.
[2] Barbu, E., Heroux, P., Adam, S., and Turpin, E. (2005)
Clustering document images using a bag of symbols
representation. In Proceedings International Conference on
Document Analysis and Recognition, 1216-1220.
[3] Bekkerman, R., McCallum, A., and Huang, G. (2004)
Automatic categorization of email into folders: benchmark
experiments on enron and sri corpora. Technical Report IR-418,
Centre for Intelligent Information Retrieval, UMASS.
http://www.cs.umass.edu/~mccallum/papers/foldering-tr05.pdf
[4] Biber, D. (1995) Dimensions of Register Variation:a Cross-
Linguistic Comparison. Cambridge University Press, New York,.
[5] Boese, E. S. (2005) Stereotyping the web: genre
classification of web documents. Master’s thesis, Colorado State
University.
[6] Breiman, L. (2001) Random forests. Machine Learning,
45:5-32.
[26] Burges, C. J. C. (1998) A Tutorial on support vector
machines for pattern recognition. Data Mining and Knowledge
Discovery, Vol 2, 121-167.
[7] Chao, C., Liaw, A., and Breiman, L. (2004) Using random
forest to learn imbalanced data.
http://www.stat.berkeley.edu/breiman/RandomForests/
[8] Chen, L., and Tang, H. L. (2004) Improved computation of
beliefs based on confusion matrix for combining multiple
classifiers. Electronic Letters, Vol 4, No 4, 238- 239.
[9] Finn, A., and Kushmerick, N. (2006) Learning to classify
documents according to genre. Journal of American Society for
Information Science and Technology, 57(11), 1506-1518.
[10] Karlgren, J., and Cutting, D. (1994) Recognizing text genres
with simple metric using discriminant analysis. In Proceedings
15th Conf. Comp. Ling., Vol 2, 1071-1075.
[11] Kessler, G., Nunberg, B., and Schuetze, H. (1997)
Automatic detection of text genre. In Proceedings 35th Ann.
Meeting ACL, 32-38.
[12] Kim, Y., and Ross, S. (2006) Genre classification in
automated ingest and appraisal metadata. In J. Gonzalo, editor,
Proceedings European Conference on advanced technology and
research in Digital Libraries (ECDL) , Lecture Notes in
Computer Science, Springer Verlag, Vol 4172, 63-74.
[13] Kim, Y., and Ross, S. (2007) Detecting family resemblance:
Automated genre classification. CODATA Data Science Journal,
ISSN:1683-1470, Vol 6, , S172-S183.
[14] Kim, Y. and Ross, S. (2007) Feature Type Analysis in
Automated Genre Classification.
http://eprints.erpanet.org/128.
[15] McCallum, A. (1996) Bow: A toolkit for statistical language
modeling, text retrieval, classification and clustering.
http://www.cs.cmu.edu/~mccallum/bow
[16] Minsky, M. (1961). "Steps toward Artificial Intelligence."
Proceedings of the IRE 49(1), 8-30.
[17] Rauber, A. and Muller-Kogler, A. (2001) Integrating
automatic genre analysis into digital libraries. In Proceedings
ACM/IEEE Joint Conf. Digital Libraries, Roanoke, VA, 1-10,
http://doi.acm.org/10.1145/379437.379439
[18] Santini, M. (2004) State-of-the-art on Automatic Genre
Identification, Technical Report ITRI-04-03, ITRI, University of
Brighton, UK.
[19] Santini, M. (2006) Towards a Zero-to-Multi-Genre
Classification Scheme, Journee ATALA "Typologies de textes
pour le traitement automatique", Paris.
http://www.nltg.brighton.ac.uk/home/Marina.Santini/marina_san
tini_ATALA2006.pdf
[20] Santini, M. (2007) Characterizing Genres of Web Pages:
Genre Hybridism and Individualization, 40th Annual Hawaii
International Conference on System Sciences (HICSS'07).
http://csdl2.computer.org/comp/proceedings/hicss/2007/2755/00/
27550071.pdf.
[21] Witten, H. I., and E. Frank. (2005) Data mining: Practical
machine learning tools and techniques. 2nd Edition, Morgan
Kaufmann, San Francisco.