Naïve Bayes vs. Decision Trees vs. Neural Networks in the Classification of Training Web Pages



IJCSI International Journal of Computer Science Issues, Vol. 4, No 1, 2009

19


The algorithm used in this research is based on the
mathematical manipulation of the probabilities of the top
100 most used keywords extracted from each web page.
However, the separate probabilities alone would not be
sufficient in classifying a new web page. The
classification of each page requires the combination of the
probabilities of all the separate features {f1, f2, ... fi, ..,
fn} into one probability, as in eq. 3 :

Л

arg max{P(Cn ιv)} « - P(,Cn) P(f, Cπ)

"                      ^          ɪ=-                  (3)
where
z is a scaling factor dependent on the feature
variables, which helps to normalise the data. This scaling
factor was added as we found that when experimenting
with smaller datasets, the feature probabilities got very
small, which made the page probabilities get close to zero.
This made the system unusable.

Once the probabilities for each category have been
calculated, the probability values are compared to each
other. The category with the highest probability, and
within a predefined threshold value, is assigned to the web
page being classified. All the features extracted from this
page are also paired up with the resulting category and the
information in the database is updated to expand the
system’s knowledge.

Our research adds an additional step to the traditional NB
algorithm to handle situations when there is very little
evidence about the data, in particular during early stages
of the classification. This step calculates a Believed
Probability (Believed Prob), which helps to calculate more
gradual probability values for the data. An initial
probability is decided for features with little information
about them; in this research the probability value decided
is equal to the probability of the page, given no evidence
(P(Cn)) The calculation followed is as shown below:

(4)

where: bpw is the believed probability weight (in this case
bpw = 1, which means that the Believed Probability is
weighed the same as one feature).

The above eliminates the assignment of extreme
probability values when little evidence exists; instead,
much more realistic probability values are achieved, which
has improved the overall accuracy of the classification
process as shown in the Results section.

4. Results

4.1 Performance Measures

The experiments in this research are evaluated using the
standard metrics of accuracy, precision, recall and f-
measure for Web Classification. These were calculated
using the predictive classification table, known as
Confusion Matrix (Table 1).

Table 1: Confusion Matrix

PREDICTED

IRRELEVANT

RELEVANT

ACTUAL

IRRELEVANT

TN

FP

RELEVANT

FN

TP

Considering Table 1:

TN (True Negative) ÷ Number of correct predictions that
an instance is
irrelevant

FP (False Positive) ÷ Number of incorrect predictions
that an instance is
relevant

FN (False Negative) ÷ Number of incorrect predictions
that an instance is
irrelevant

TP (True Positive) ÷ Number of correct predictions that
an instance is
relevant

Accuracy - The proportion of the total number of
predictions that were correct:

Accuracy (%) =

(TN + TP) / (TN + FN + FP + TP) (5)

Precision - The proportion of the predicted relevant pages
that were correct:

Precision (%) = TP / (FP + TP)         (6)

Recall - The proportion of the relevant pages that were
correctly identified

Recall (%) = TP / (FN + TP)           (7)

F-Measure - Derives from precision and recall values:
F-Measure (%) =

(2 x Recall x Precision) / (Recall + Precision) (8)

The F-Measure was used, because despite Precision and
Recall being valid metrics in their own right, one can be
optimised at the expense of the other ([22]). The F-
Measure only produces a high result when Precision and
Recall are both balanced, thus this is very significant.

A Receiver Operating Characteristic (ROC) curve analysis
was also performed, as it shows the sensitivity (FN
classifications) and specificity (FP classifications) of a

IJCSI



More intriguing information

1. The name is absent
2. Short report "About a rare cause of primary hyperparathyroidism"
3. The name is absent
4. Spectral calibration of exponential Lévy Models [1]
5. The name is absent
6. The name is absent
7. A Study of Prospective Ophthalmology Residents’ Career Perceptions
8. Stable Distributions
9. MULTIMODAL SEMIOTICS OF SPIRITUAL EXPERIENCES: REPRESENTING BELIEFS, METAPHORS, AND ACTIONS
10. A Unified Model For Developmental Robotics
11. The name is absent
12. The name is absent
13. Innovation Trajectories in Honduras’ Coffee Value Chain. Public and Private Influence on the Use of New Knowledge and Technology among Coffee Growers
14. Making International Human Rights Protection More Effective: A Rational-Choice Approach to the Effectiveness of Ius Standi Provisions
15. Markets for Influence
16. The name is absent
17. THE ANDEAN PRICE BAND SYSTEM: EFFECTS ON PRICES, PROTECTION AND PRODUCER WELFARE
18. The name is absent
19. Self-Help Groups and Income Generation in the Informal Settlements of Nairobi
20. TINKERING WITH VALUATION ESTIMATES: IS THERE A FUTURE FOR WILLINGNESS TO ACCEPT MEASURES?