Naïve Bayes vs. Decision Trees vs. Neural Networks in the Classification of Training Web Pages



IJCSI International Journal of Computer Science Issues, Vol. 4, No 1, 2009

17


stages of the classification process. Experiments have
shown that our classifier exceeds expectations, achieving
an impressive F-Measure value of over 97%.

2. Related Work

Many ideas have emerged over the years on how to
achieve quality results from Web Classification systems,
thus there are different approaches that can be used to a
degree such as Clustering, NB and Bayesian Networks,
NNs, DTs, Support Vector Machines (SVMs) etc. We
decided to only concentrate on NN, DT and NB
classifiers, as they proved more closely applicable to our
project. Despite the benefits of other approaches, our
research is in collaboration with a small organisation, thus
we had to consider the organisation’s hardware and
software limitations before deciding on a classification
technique. SVM and Clustering would be too expensive
and processor intensive for the organisation, thus they
were considered inappropriate for this project. The
following discusses the pros and cons of NB, DTs and
NNs, as well as related research works in each field.

2.1 Naïve Bayes Models

NB models are popular in machine learning applications,
due to their simplicity in allowing each attribute to
contribute towards the final decision equally and
independently from the other attributes. This simplicity
equates to computational efficiency, which makes NB
techniques attractive and suitable for many domains.

However, the very same thing that makes them popular, is
also the reason given by some researchers, who consider
this approach to be weak. The conditional independence
assumption is strong, and makes NB-based systems
incapable of using two or more pieces of evidence
together, however, used in appropriate domains, they offer
quick training, fast data analysis and decision making, as
well as straightforward interpretation of test results. There
is some research ([13], [26]) trying to relax the conditional
independence assumption by introducing latent variables
in their tree-shaped or hierarchical NB classifiers.
However, a thorough analysis of a large number of
training web pages has shown us that the features used in
these pages can be independently examined to compute
the category for each page. Thus, the domain for our
research can easily be analysed using NB classifiers,
however, in order to increase the system’s accuracy, the
classifier has been enhanced as described in section 3.
Enhancing the standard NB rule or using it in
collaboration with other techniques has also been
attempted by other researchers. Addin et al in [1] coupled
a NB classifier with K-Means clustering to simulate
damage detection in engineering materials. NBTree in [24]
induced a hybrid of NB and DTs by using the Bayes rule
to construct the decision tree. Other research works ([5],
[23]) have modified their NB classifiers to learn from
positive and unlabeled examples. Their assumption is that
finding negative examples is very difficult for certain
domains, particularly in the medical industry. Finding
negative examples for the training courses domain,
however, is not at all difficult, thus the above is not an
issue for our research.

2.2 Decision Trees

Unlike NB classifiers, DT classifiers can cope with
combinations of terms and can produce impressive results
for some domains. However, training a DT classifier is
quite complex and they can get out of hand with the
number of nodes created in some cases. According to [17],
with six Boolean attributes there would be need for
18,446,744,073,709,551,616 distinct nodes. Decision trees
may be computationally expensive for certain domains,
however, they make up for it by offering a genuine
simplicity of interpreting models, and helping to consider
the most important factors in a dataset first by placing
them at the top of the tree.

The researchers in [7], [12], [15] all used DTs to allow for
both the structure and the content of each web page to
determine the category in which they belong. An accuracy
of under 85% accuracy was achieved by all. This idea is
very similar to our work, as our classifier also analyses
both structure and content. WebClass in [12] was designed
to search geographically distributed groups of people, who
share common interests. WebClass modifies the standard
decision tree approach by associating the tree root node
with only the keywords found, depth-one nodes with
descriptions and depth-two nodes with the hyperlinks
found. The system however, only achieved 73% accuracy.
The second version of WebClass ([2]) implemented
various classification models such as: Bayes networks,
DTs, K-Means clustering and SVMs in order to compare
findings of WebClassII. However, findings showed that
for increasing feature set sizes, the overall recall fell to
just 39.75%.

2.3 Neural Networks

NNs are powerful techniques for representing complex
relationships between inputs and outputs. Based on the
neural structure of the brain ([17]), NNs are complicated
and they can be enormous for certain domains, containing
a large number of nodes and synapses. There is research
that has managed to convert NNs into sets of rules in order
to discover what the NN has learnt ([8], [21]), however,
many other works still refer to NNs as a ‘black box’

IJCSI



More intriguing information

1. Co-ordinating European sectoral policies against the background of European Spatial Development
2. The name is absent
3. The name is absent
4. Portuguese Women in Science and Technology (S&T): Some Gender Features Behind MSc. and PhD. Achievement
5. The name is absent
6. Towards a Strategy for Improving Agricultural Inputs Markets in Africa
7. The name is absent
8. The name is absent
9. Restricted Export Flexibility and Risk Management with Options and Futures
10. CONSUMER PERCEPTION ON ALTERNATIVE POULTRY
11. The name is absent
12. The name is absent
13. The name is absent
14. Wettbewerbs- und Industriepolitik - EU-Integration als Dritter Weg?
15. FUTURE TRADE RESEARCH AREAS THAT MATTER TO DEVELOPING COUNTRY POLICYMAKERS
16. Dual Inflation Under the Currency Board: The Challenges of Bulgarian EU Accession
17. The Values and Character Dispositions of 14-16 Year Olds in the Hodge Hill Constituency
18. The name is absent
19. Pass-through of external shocks along the pricing chain: A panel estimation approach for the euro area
20. Has Competition in the Japanese Banking Sector Improved?