Naïve Bayes vs. Decision Trees vs. Neural Networks in the Classification of Training Web Pages



Provided by Cognitive Sciences ePrint Archive

IJCSI International Journal of Computer Science Issues, Vol. 4, No. 1, 2009

ISSN (Online): 1694-0784

ISSN (Print): 1694-0814

16


Naïve Bayes vs. Decision Trees vs. Neural Networks in the
Classification of Training Web Pages

Daniela XHEMALI1, Christopher J. HINDE2 and Roger G. STONE3

1 Computer Science, Loughborough University
Loughborough, Leicestershire, LE11 3TU, UK

[email protected]

2 Computer Science, Loughborough University
Loughborough, Leicestershire, LE11 3TU, UK
[email protected]

3 Computer Science, Loughborough University
Loughborough, Leicestershire, LE11 3TU, UK
[email protected]

Abstract

Web classification has been attempted through many different
technologies. In this study we concentrate on the comparison of
Neural Networks (NN), Naïve Bayes (NB) and Decision Tree
(DT) classifiers for the automatic analysis and classification of
attribute data from training course web pages. We introduce an
enhanced NB classifier and run the same data sample through the
DT and NN classifiers to determine the success rate of our
classifier in the training courses domain. This research shows
that our enhanced NB classifier not only outperforms the
traditional NB classifier, but also performs similarly as good, if
not better, than some more popular, rival techniques. This paper
also shows that, overall, our NB classifier is the best choice for
the training courses domain, achieving an impressive F-Measure
value of over 97%, despite it being trained with fewer samples
than any of the classification systems we have encountered.

Keywords: Web classification, Naïve Bayesian Classifier,
Decision Tree Classifier, Neural Network Classifier, Supervised
learning.

1. Introduction

Managing the vast amount of online information and
classifying it into what could be relevant to our needs is an
important step towards being able to use this information.
Thus, it comes as no surprise that the popularity of Web
Classification applies not only to the academic needs for
continuous knowledge growth, but also to the needs of
industry for quick, efficient solutions to information
gathering and analysis in maintaining up-to-date
information that is critical to the business success.

This research is part of a larger research project in
collaboration with an independent brokerage organisation,
Apricot Training Management (ATM), which helps other
organisations to identify and analyse their training needs
and recommend suitable courses for their employees.
Currently, the latest prospectuses from different training
providers are ordered, catalogued, shelved and the course
information found is manually entered into the company’s
database. This is a time consuming, labour-intensive
process, which does not guarantee always up-to-date
results, due to the limited life expectancy of some course
information such as dates and prices and other limitations
in the availability of up-to-date, accurate information on
websites and printed literature. The overall project is
therefore to automate the process of retrieving, extracting
and storing course information into the database
guaranteeing it is always kept up-to-date.

The research presented in this paper is related to the
information retrieval side of the project, in particular to the
automatic analysis and filtering of the retrieved web pages
according to their relevance. This classification process is
vital to the efficiency of the overall system, as only
relevant pages will then be considered by the extraction
process, thus drastically reducing processing time &
increasing accuracy.

The underlining technique used for our classifier is based
on the NB algorithm, due to the independence noticed in
the data corpus analysed. The traditional technique is
enhanced however, to analyse not only the visible textual
content of web pages, but also important web structures
such as META data, TITLE and LINK information.
Additionally, a ‘believed probability’ of features in each
category is calculated to handle situations when there is
little evidence about the data, particularly in the early


IJCSI



More intriguing information

1. Perfect Regular Equilibrium
2. Review of “From Political Economy to Economics: Method, the Social and Historical Evolution of Economic Theory”
3. The name is absent
4. The name is absent
5. The name is absent
6. Demographic Features, Beliefs And Socio-Psychological Impact Of Acne Vulgaris Among Its Sufferers In Two Towns In Nigeria
7. Ahorro y crecimiento: alguna evidencia para la economía argentina, 1970-2004
8. Are class size differences related to pupils’ educational progress and classroom processes? Findings from the Institute of Education Class Size Study of children aged 5-7 Years
9. Anti Microbial Resistance Profile of E. coli isolates From Tropical Free Range Chickens
10. Wirtschaftslage und Reformprozesse in Estland, Lettland, und Litauen: Bericht 2001
11. Special and Differential Treatment in the WTO Agricultural Negotiations
12. From Communication to Presence: Cognition, Emotions and Culture towards the Ultimate Communicative Experience. Festschrift in honor of Luigi Anolli
13. The name is absent
14. The name is absent
15. The name is absent
16. The name is absent
17. Determinants of Household Health Expenditure: Case of Urban Orissa
18. Enterpreneurship and problems of specialists training in Ukraine
19. Gender and headship in the twenty-first century
20. The name is absent