Naïve Bayes vs. Decision Trees vs. Neural Networks in the Classification of Training Web Pages



IJCSI International Journal of Computer Science Issues, Vol. 4, No 1, 2009

22


in accuracy, in comparison with the original naïve bayes
algorithm.

The NB classifier was tested against 8725 sampling units
after being trained with only 711 units. This exact same
sample was also analysed by a DT and a NN classifier and
the results from all systems were compared to one-another.
Our experiments showed that although some NN
classifiers can be very accurate for some domains, they
take the longest to train and have extensibility issues due
to their extremely large and complex nature. It was
therefore realized that NNs would be too expensive for
ATM and unsuitable for handling a potentially large
number of features created by the classification process.

On a more positive note, our experiments produced
exciting findings for the application of the NB algorithm
in the training courses domain, as the NB classifier
achieved impressive results, including the highest
Precision value (99.37%) and F-Measure (97.26%).
Although some of the results are close to the results from
the DT classifier, these experiments show that Naïve
Bayes Classifiers should not be considered inferior to
more complex techniques such as Decision Trees or
Neural Networks. They are fast, consistent, easy to
maintain and accurate in the classification of attribute data,
such as the training courses domain. In one of our
previous papers ([25]), we expressed our concern that
many researchers go straight for the more complex
approaches without trying out the simpler ones first. We
hope this paper will encourage researchers to exploit the
simpler techniques, as they can be, as this paper showed,
more efficient and much less expensive.

The system may be improved further by reducing the
number of features analysed. More research needs to be
done to establish a possible cut off point for the extracted
features. This may speed up the classification process as
well as potentially improve the classifier further. More
tests will also be done to confirm the NB classifier’s
success on a grander scale. In conclusion, this research has
shown that the NB approach, enhanced to perform even
with limited information, whilst analysing both web
content and structural information, gives very promising
results in the training courses domain, outperforming
powerful and popular rivals such as decision trees and
neural networks.

Acknowledgments

We would like to thank the whole team at ATM for the
support and help they have offered us since the first day of
the project. Also, thank you to both ATM and the Centre
for Innovative and Collaborative Engineering (CICE) for
funding our work.

References

[1] Addin, O., Sapuan, S. M., Mahdi, E., & Othman, M. “A
Naive-Bayes classifier for damage detection in engineering
materials”,
Materials and Design, 2007, pp. 2379-2386.

[2] Ceci, M., & Malerba, D. “Hierarchical Classification of
HTML Documents with WebClassII”,
Lecture Notes in
Computer Science
, 2003, pp. 57-72.

[3] Chau, M., & Chen, H. “A machine learning approach to
web page filtering using content and structure analysis”,
Decision Support Systems, Vol. 44, No. 2, 2007, pp. 482-494.

[4] Crestani, F. “An Adaptive Information Retrieval System
Based on Neural Networks”, in:
International Workshop on
Artificial Neural Networks: New Trends in Neural
Computation,
Vol. 686, 1993, pp. 732-737.

[5] Denis, F., Laurent, A., Gilleron, R., Tommasi, M. “Text
classification and co-training from positive and unlabeled
examples”, in:
ICML Workshop: The Continuum from
Labeled to Unlabeled Data
, 2003, pp. 80-87.

[6] Enhong, C., Shangfei, W., Zhenya, Z. & W. Xufa.
“Document classification with CC4 neural network”, in:
Proceedings of ICONIP, Shanghai, China, 2001.

[7] Estruch, V., Ferri, C., Hernandez-Orallo, J., & Ramirez-
Quintana, M. J. “Web Categorisation Using Distance-Based
Decision Trees”, in:
International Workshop on Automated
Specification and Verification of Web Site
, 2006, pp. 35-40.

[8] Fletcher, G.P & Hinde, C.J. “Interpretation of Neural
Networks as Boolean Transfer Functions”,
Knowledge-Based
Systems
, Vol. 7, No. 3, 1994, 207-214.

[9] Fletcher, G.P & Hinde, C.J. “Using Neural Networks as a
Tool for Constructing Rule Based Systems”,
Knowledge-Based
Systems
, Vol. 8, No. 4, 1995, 183-189.

[10] Fletcher, G.P & Hinde, C.J. “Producing Evidence for the
Hypotheses of Large Neural Networks”,
Neurocomputing, Vol.
10, 1996, pp. 359-373.

[11] Hinde, C.J., & Fletcher, G.P., West, A.A. & Williams, D.J.
“Neural Networks”,
ICL Systems Journal, Vol. 11, No. 2,
1997, pp. 244-278.

[12] Hu, W., Chang, K. & Ritter, G. “WebClass: Web Document
Classification Using Modified Decision Trees”, in:
38th Annual
Southeast Regional Conference
, 2000, pp. 262-263.

[13] Langseth, H. & Nielsen, T. “Classification using
Hierarchical Naïve Bayes models”,
Machine Learning, Vol. 63,
No. 2, 2006, pp. 135-159.

[14] Liu, Z. & Zhang, Y. “A competitive neural network
approach to web-page categorization”,
International Journal of
Uncertainty, Fuzziness & Knowledge Systems
, Vol. 9, 2001,
pp. 731-741.

[15] Orallo, J. “Extending Decision Trees for Web
Categorisation”, in:
2nd Annual Conference of the ICT for EU-
India Cross Cultural Dissemination
, 2005.

[16] Quinlan, J. R. “Improved use of continuous attributes in
C4.5”,
Journal of Artificial Intelligence Research, Vol. 4,
1996, pp. 77-90.

[17] Russell, S. & Norvig, P. Artificial Intelligence: A Modern
Approach
, London: Prentice Hall, 2003.

IJCSI



More intriguing information

1. Synthesis and biological activity of α-galactosyl ceramide KRN7000 and galactosyl (α1→2) galactosyl ceramide
2. The name is absent
3. DISCRIMINATORY APPROACH TO AUDITORY STIMULI IN GUINEA FOWL (NUMIDA MELEAGRIS) AFTER HYPERSTRIATAL∕HIPPOCAMP- AL BRAIN DAMAGE
4. The name is absent
5. Trade Openness and Volatility
6. Female Empowerment: Impact of a Commitment Savings Product in the Philippines
7. Sectoral Energy- and Labour-Productivity Convergence
8. Spectral density bandwith choice and prewightening in the estimation of heteroskadasticity and autocorrelation consistent covariance matrices in panel data models
9. The name is absent
10. Examining the Regional Aspect of Foreign Direct Investment to Developing Countries
11. The name is absent
12. Inflation and Inflation Uncertainty in the Euro Area
13. Migration and Technological Change in Rural Households: Complements or Substitutes?
14. The name is absent
15. Experimental Evidence of Risk Aversion in Consumer Markets: The Case of Beef Tenderness
16. Regional specialisation in a transition country - Hungary
17. GOVERNANÇA E MECANISMOS DE CONTROLE SOCIAL EM REDES ORGANIZACIONAIS
18. MICROWORLDS BASED ON LINEAR EQUATION SYSTEMS: A NEW APPROACH TO COMPLEX PROBLEM SOLVING AND EXPERIMENTAL RESULTS
19. The Trade Effects of MERCOSUR and The Andean Community on U.S. Cotton Exports to CBI countries
20. MULTIMODAL SEMIOTICS OF SPIRITUAL EXPERIENCES: REPRESENTING BELIEFS, METAPHORS, AND ACTIONS