IJCSI International Journal of Computer Science Issues, Vol. 4, No 1, 2009
20
test. The ROC curve is a comparison of two
characteristics: TPR (true positive rate) and FPR (false
positive rate). The TPR measures the number of relevant
pages that were correctly identified.
TPR = TP / (TP + FN) (9)
The FPR measures the number of incorrect classifications
of relevant pages out of all irrelevant test pages.
FPR = FP / (FP + TN) (10)
In the ROC space graph, FPR and TPR values form the x
and y axes respectively. Each prediction (FPR, TPR)
represents one point in the ROC space. There is a diagonal
line that connects points with coordinates (0, 0) and (1, 1).
This is called the “line of no-discriminations’ and all the
points along this line are considered to be completely
random guesses. The points above the diagonal line
indicate good classification results, whereas the points
below the line indicate wrong results. The best prediction
(i.e. 100% sensitivity and 100% specificity), also known
as ‘perfect classification’, would be at point (0, 1). Points
closer to this coordinate show better classification results
than other points in the ROC space.
4.2 Data Corpus
In this research, each web page is referred to as a sampling
unit. Each sampling unit comprises a maximum of 100
features, which are selected after discarding much of the
page content, as explained previously. The total number of
unique features examined in the following experiments
was 5217. The total number of sampling units used was
9436. These units were separated into two distinct sets: a
training set and a test set.
The training set for the NB classifier consisted of 711
randomly selected, positive and negative examples (i.e.
relevant and irrelevant sampling units). The test collection
created consisted of data obtained from the remaining
8725 sampling units. The training set makes for under
10% of the size of the entire data corpus. This was
intentionally decided, in order to really challenge the NB
classifier. Compared to many other classification systems
encountered, this is the smallest training set used.
4.3 Experimental Results
The first experiment that was carried out was to test our
enhanced NB classifier against the standard naïve bayes
algorithm, in order to determine whether or not the
changes made to the original algorithm had enhanced the
accuracy of the classifier. For this purpose, we stripped
our system of the additional steps and executed both
standard and enhanced NB classifiers with the above
training and test data. The results showed that the
enhanced NB classifier was comfortably in the lead by
over 7% in both accuracy and F-Measure value.
In the second set of experiments, the sampling units
analysed by the NB classifier were also run by a DT
classifier and an NN classifier. The results were compared
to determine which classifier is better at analysing
attribute data from training web pages. The DT classifier
is a ‘C’ program, based on the C4.5 algorithm in [16],
written to evaluate data samples and find the main
pattern(s) emerging from the data. For example, the DT
classifier may conclude that all web pages containing a
specific word are all relevant. More complex data samples
however, may result in more complex configurations
found.
The NN classifier used is also a ‘C’ program, based on the
work published in [8]-[11]. MATLAB’s NN toolbox
([20]) could have also been used, however in past
experiments MATLAB managed approximately 2 training
epochs compared to the ‘C’ NN classifier, which achieved
approximately 60,000 epochs in the same timeframe. We,
therefore, abandoned MATLAB for the bespoke compiled
NN system.
All three classifiers were initially trained with 105
sampling units and tested with a further 521 units, all
consisting of a total of 3900 unique features. The NB
classifier achieved the highest accuracy (97.89%),
precision (99.20%), recall (98.61%) and F-Measure
(98.90%) values, however, the DT classifier achieved the
fastest execution time. The NN classifier, created with
3900 inputs, 3900 midnodes and 1 output, came last in all
metrics and execution time.
For the most recent test, all classifiers were trained with
711 sampling units and they were then tested on the
remaining 8725 sampling units. The NB and DT
classifiers were adequately fast for exploitation and
delivered good discriminations. The test results are shown
in Table 2 and Table 3.
Table 2: Confusion Matrix for NB Classifier
______PREDICTED______ | |||
IRRELEVANT |
RELEVANT | ||
ACTUAL |
IRRELEVANT |
TN / 876 |
FP / 47 ~ |
RELEVANT |
FN / 372 |
TP /7430 |
Table 3: Confusion Matrix for DT Classifier
IJCSI