Naïve Bayes vs. Decision Trees vs. Neural Networks in the Classification of Training Web Pages



Provided by Cognitive Sciences ePrint Archive

IJCSI International Journal of Computer Science Issues, Vol. 4, No. 1, 2009

ISSN (Online): 1694-0784

ISSN (Print): 1694-0814

16


Naïve Bayes vs. Decision Trees vs. Neural Networks in the
Classification of Training Web Pages

Daniela XHEMALI1, Christopher J. HINDE2 and Roger G. STONE3

1 Computer Science, Loughborough University
Loughborough, Leicestershire, LE11 3TU, UK

[email protected]

2 Computer Science, Loughborough University
Loughborough, Leicestershire, LE11 3TU, UK
[email protected]

3 Computer Science, Loughborough University
Loughborough, Leicestershire, LE11 3TU, UK
[email protected]

Abstract

Web classification has been attempted through many different
technologies. In this study we concentrate on the comparison of
Neural Networks (NN), Naïve Bayes (NB) and Decision Tree
(DT) classifiers for the automatic analysis and classification of
attribute data from training course web pages. We introduce an
enhanced NB classifier and run the same data sample through the
DT and NN classifiers to determine the success rate of our
classifier in the training courses domain. This research shows
that our enhanced NB classifier not only outperforms the
traditional NB classifier, but also performs similarly as good, if
not better, than some more popular, rival techniques. This paper
also shows that, overall, our NB classifier is the best choice for
the training courses domain, achieving an impressive F-Measure
value of over 97%, despite it being trained with fewer samples
than any of the classification systems we have encountered.

Keywords: Web classification, Naïve Bayesian Classifier,
Decision Tree Classifier, Neural Network Classifier, Supervised
learning.

1. Introduction

Managing the vast amount of online information and
classifying it into what could be relevant to our needs is an
important step towards being able to use this information.
Thus, it comes as no surprise that the popularity of Web
Classification applies not only to the academic needs for
continuous knowledge growth, but also to the needs of
industry for quick, efficient solutions to information
gathering and analysis in maintaining up-to-date
information that is critical to the business success.

This research is part of a larger research project in
collaboration with an independent brokerage organisation,
Apricot Training Management (ATM), which helps other
organisations to identify and analyse their training needs
and recommend suitable courses for their employees.
Currently, the latest prospectuses from different training
providers are ordered, catalogued, shelved and the course
information found is manually entered into the company’s
database. This is a time consuming, labour-intensive
process, which does not guarantee always up-to-date
results, due to the limited life expectancy of some course
information such as dates and prices and other limitations
in the availability of up-to-date, accurate information on
websites and printed literature. The overall project is
therefore to automate the process of retrieving, extracting
and storing course information into the database
guaranteeing it is always kept up-to-date.

The research presented in this paper is related to the
information retrieval side of the project, in particular to the
automatic analysis and filtering of the retrieved web pages
according to their relevance. This classification process is
vital to the efficiency of the overall system, as only
relevant pages will then be considered by the extraction
process, thus drastically reducing processing time &
increasing accuracy.

The underlining technique used for our classifier is based
on the NB algorithm, due to the independence noticed in
the data corpus analysed. The traditional technique is
enhanced however, to analyse not only the visible textual
content of web pages, but also important web structures
such as META data, TITLE and LINK information.
Additionally, a ‘believed probability’ of features in each
category is calculated to handle situations when there is
little evidence about the data, particularly in the early


IJCSI



More intriguing information

1. PROTECTING CONTRACT GROWERS OF BROILER CHICKEN INDUSTRY
2. CONSUMER PERCEPTION ON ALTERNATIVE POULTRY
3. A Regional Core, Adjacent, Periphery Model for National Economic Geography Analysis
4. The name is absent
5. The name is absent
6. he Virtual Playground: an Educational Virtual Reality Environment for Evaluating Interactivity and Conceptual Learning
7. Automatic Dream Sentiment Analysis
8. The name is absent
9. WP 48 - Population ageing in the Netherlands: Demographic and financial arguments for a balanced approach
10. Wettbewerbs- und Industriepolitik - EU-Integration als Dritter Weg?
11. The name is absent
12. The Evolution
13. Behavior-Based Early Language Development on a Humanoid Robot
14. FISCAL CONSOLIDATION AND DECENTRALISATION: A TALE OF TWO TIERS
15. The name is absent
16. National urban policy responses in the European Union: Towards a European urban policy?
17. Orientation discrimination in WS 2
18. The effect of classroom diversity on tolerance and participation in England, Sweden and Germany
19. AMINO ACIDS SEQUENCE ANALYSIS ON COLLAGEN
20. CHANGING PRICES, CHANGING CIGARETTE CONSUMPTION