Naïve Bayes vs. Decision Trees vs. Neural Networks in the Classification of Training Web Pages



IJCSI International Journal of Computer Science Issues, Vol. 4, No 1, 2009

20


test. The ROC curve is a comparison of two
characteristics: TPR (true positive rate) and FPR (false
positive rate). The TPR measures the number of
relevant
pages that were correctly identified.

TPR = TP / (TP + FN)          (9)

The FPR measures the number of incorrect classifications
of
relevant pages out of all irrelevant test pages.

FPR = FP / (FP + TN)         (10)

In the ROC space graph, FPR and TPR values form the x
and y axes respectively. Each prediction (FPR, TPR)
represents one point in the ROC space. There is a diagonal
line that connects points with coordinates (0, 0) and (1, 1).
This is called the “line of no-discriminations’ and all the
points along this line are considered to be completely
random guesses. The points above the diagonal line
indicate good classification results, whereas the points
below the line indicate wrong results. The best prediction
(i.e. 100% sensitivity and 100% specificity), also known
as ‘perfect classification’, would be at point (0, 1). Points
closer to this coordinate show better classification results
than other points in the ROC space.

4.2 Data Corpus

In this research, each web page is referred to as a sampling
unit. Each sampling unit comprises a maximum of 100
features, which are selected after discarding much of the
page content, as explained previously. The total number of
unique features examined in the following experiments
was 5217. The total number of sampling units used was
9436. These units were separated into two distinct sets: a
training set and a test set.

The training set for the NB classifier consisted of 711
randomly selected, positive and negative examples (i.e.
relevant and irrelevant sampling units). The test collection
created consisted of data obtained from the remaining
8725 sampling units. The training set makes for under
10% of the size of the entire data corpus. This was
intentionally decided, in order to really challenge the NB
classifier. Compared to many other classification systems
encountered, this is the smallest training set used.

4.3 Experimental Results

The first experiment that was carried out was to test our
enhanced NB classifier against the standard naïve bayes
algorithm, in order to determine whether or not the
changes made to the original algorithm had enhanced the
accuracy of the classifier. For this purpose, we stripped
our system of the additional steps and executed both
standard and enhanced NB classifiers with the above
training and test data. The results showed that the
enhanced NB classifier was comfortably in the lead by
over 7% in both accuracy and F-Measure value.

In the second set of experiments, the sampling units
analysed by the NB classifier were also run by a DT
classifier and an NN classifier. The results were compared
to determine which classifier is better at analysing
attribute data from training web pages. The DT classifier
is a ‘C’ program, based on the C4.5 algorithm in [16],
written to evaluate data samples and find the main
pattern(s) emerging from the data. For example, the DT
classifier may conclude that all web pages containing a
specific word are all relevant. More complex data samples
however, may result in more complex configurations
found.

The NN classifier used is also a ‘C’ program, based on the
work published in [8]-[11]. MATLAB’s NN toolbox
([20]) could have also been used, however in past
experiments MATLAB managed approximately 2 training
epochs compared to the ‘C’ NN classifier, which achieved
approximately 60,000 epochs in the same timeframe. We,
therefore, abandoned MATLAB for the bespoke compiled
NN system.

All three classifiers were initially trained with 105
sampling units and tested with a further 521 units, all
consisting of a total of 3900 unique features. The NB
classifier achieved the highest accuracy (
97.89%),
precision (
99.20%), recall (98.61%) and F-Measure
(
98.90%) values, however, the DT classifier achieved the
fastest execution time. The NN classifier, created with
3900 inputs, 3900 midnodes and 1 output, came last in all
metrics and execution time.

For the most recent test, all classifiers were trained with
711 sampling units and they were then tested on the
remaining 8725 sampling units. The NB and DT
classifiers were adequately fast for exploitation and
delivered good discriminations. The test results are shown
in Table 2 and Table 3.

Table 2: Confusion Matrix for NB Classifier

______PREDICTED______

IRRELEVANT

RELEVANT

ACTUAL

IRRELEVANT

TN / 876

FP / 47 ~

RELEVANT

FN / 372

TP /7430

Table 3: Confusion Matrix for DT Classifier

IJCSI



More intriguing information

1. Ongoing Emergence: A Core Concept in Epigenetic Robotics
2. Feature type effects in semantic memory: An event related potentials study
3. The Value of Cultural Heritage Sites in Armenia: Evidence From a Travel Cost Method Study
4. Heterogeneity of Investors and Asset Pricing in a Risk-Value World
5. The name is absent
6. The name is absent
7. The name is absent
8. The name is absent
9. The name is absent
10. CONSIDERATIONS CONCERNING THE ROLE OF ACCOUNTING AS INFORMATIONAL SYSTEM AND ASSISTANCE OF DECISION
11. Does Market Concentration Promote or Reduce New Product Introductions? Evidence from US Food Industry
12. The name is absent
13. Activation of s28-dependent transcription in Escherichia coli by the cyclic AMP receptor protein requires an unusual promoter organization
14. The name is absent
15. Wirkung einer Feiertagsbereinigung des Länderfinanzausgleichs: eine empirische Analyse des deutschen Finanzausgleichs
16. Optimal Rent Extraction in Pre-Industrial England and France – Default Risk and Monitoring Costs
17. The Folklore of Sorting Algorithms
18. Regional dynamics in mountain areas and the need for integrated policies
19. European Integration: Some stylised facts
20. Corporate Taxation and Multinational Activity