Provided by Cognitive Sciences ePrint Archive
IJCSI International Journal of Computer Science Issues, Vol. 4, No. 1, 2009
ISSN (Online): 1694-0784
ISSN (Print): 1694-0814
16
Naïve Bayes vs. Decision Trees vs. Neural Networks in the
Classification of Training Web Pages
Daniela XHEMALI1, Christopher J. HINDE2 and Roger G. STONE3
1 Computer Science, Loughborough University
Loughborough, Leicestershire, LE11 3TU, UK
2 Computer Science, Loughborough University
Loughborough, Leicestershire, LE11 3TU, UK
[email protected]
3 Computer Science, Loughborough University
Loughborough, Leicestershire, LE11 3TU, UK
[email protected]
Abstract
Web classification has been attempted through many different
technologies. In this study we concentrate on the comparison of
Neural Networks (NN), Naïve Bayes (NB) and Decision Tree
(DT) classifiers for the automatic analysis and classification of
attribute data from training course web pages. We introduce an
enhanced NB classifier and run the same data sample through the
DT and NN classifiers to determine the success rate of our
classifier in the training courses domain. This research shows
that our enhanced NB classifier not only outperforms the
traditional NB classifier, but also performs similarly as good, if
not better, than some more popular, rival techniques. This paper
also shows that, overall, our NB classifier is the best choice for
the training courses domain, achieving an impressive F-Measure
value of over 97%, despite it being trained with fewer samples
than any of the classification systems we have encountered.
Keywords: Web classification, Naïve Bayesian Classifier,
Decision Tree Classifier, Neural Network Classifier, Supervised
learning.
1. Introduction
Managing the vast amount of online information and
classifying it into what could be relevant to our needs is an
important step towards being able to use this information.
Thus, it comes as no surprise that the popularity of Web
Classification applies not only to the academic needs for
continuous knowledge growth, but also to the needs of
industry for quick, efficient solutions to information
gathering and analysis in maintaining up-to-date
information that is critical to the business success.
This research is part of a larger research project in
collaboration with an independent brokerage organisation,
Apricot Training Management (ATM), which helps other
organisations to identify and analyse their training needs
and recommend suitable courses for their employees.
Currently, the latest prospectuses from different training
providers are ordered, catalogued, shelved and the course
information found is manually entered into the company’s
database. This is a time consuming, labour-intensive
process, which does not guarantee always up-to-date
results, due to the limited life expectancy of some course
information such as dates and prices and other limitations
in the availability of up-to-date, accurate information on
websites and printed literature. The overall project is
therefore to automate the process of retrieving, extracting
and storing course information into the database
guaranteeing it is always kept up-to-date.
The research presented in this paper is related to the
information retrieval side of the project, in particular to the
automatic analysis and filtering of the retrieved web pages
according to their relevance. This classification process is
vital to the efficiency of the overall system, as only
relevant pages will then be considered by the extraction
process, thus drastically reducing processing time &
increasing accuracy.
The underlining technique used for our classifier is based
on the NB algorithm, due to the independence noticed in
the data corpus analysed. The traditional technique is
enhanced however, to analyse not only the visible textual
content of web pages, but also important web structures
such as META data, TITLE and LINK information.
Additionally, a ‘believed probability’ of features in each
category is calculated to handle situations when there is
little evidence about the data, particularly in the early
IJCSI