Database Search Strategies for Proteomic Data Sets
Table 4. Effect of Presearch Trimming of ECD DTA Files on | |
forward reverse |
ID rate |
presearch hits (2+) hits | |
ECD Search (3341 DTAs) OMSSA | |
No DTA trimming 1390 (733) 12 |
41.6 |
Remove noise peak 1409 (745) 14 |
42.2 |
Remove precursor window 1453 (786) 13 |
43.5 |
Remove precursor window. 1470 (803) 13 Remove all nonfragments (MH to MH - 140 m/z). |
44.0 |
Remove noise peak 1469 (801) 13 and precursor window |
44.0 |
Remove noise peak 1480 (813) 14 and precursor window. Remove all nonfragments (MH to MH - 140 m/z). |
44.3 |
a Identifications from doubly-charged precursors |
shown in |
parentheses. As processing also affects the scores of the reverse hits, the | |
optimum E-value cutoff varies between 2.36 × 10-1 and 8.01 |
× 10-1. For |
direct comparison, the more conservative value of 2.36 × 10- |
1 is used for |
0.04% of this range. The fraction of useful fragment peaks
removed is expected to be less than 0.04%, as fragment peaks
coinciding with the precursor will, in any case, not be distin-
guishable from the high intensity precursor. The increase in
identifications shown in Tables 3 and 4 indicates that the net
effect is positive. By plotting every identification score, with
and without removal of the precursor isolation window, we see
that a vast majority of identifications result in an increased
score (Figure 3). (In a small number of cases, a previously
accepted identification was replaced with an unacceptable
identification (reverse hit or unacceptable mass error), due to
the score of the unacceptable identification increasing by more
than the score of the previously accepted identification. These
cases are plotted as a score of zero.)
To control for any unanticipated effects of this trimming
process, a window of the same size (6 m/z), but shifted away
from the precursor isolation region (+25 m/z) was removed.
This had no net effect on the number of ECD identifications
(data not shown). Removing the precursor isolation window
from CID DTAs also has no net effect, positive or negative (data
not shown). This observation agrees with the lack of high
intensity intact precursor surviving in this region during the
ion-trap excitation event, leaving no intense peaks to potentially
confound the database search.
Neutral Losses from Charge-Reduced Precursor. A salient
feature of ECD is the potential for neutral losses from the
charge-reduced precursor.6,7 These losses are particularly
evident in ECD mass spectra of doubly charged precursors. The
tryptic peptides are predominantly doubly charged: 67% of the
ECD mass spectra collected were from doubly charged precur-
sors. However, the success rate for the ECD identification of
doubly charged precursors was lower than that for triply
charged species (32% versus 54%, for the Mascot unaltered
search). Figure 4 shows the proportion of identifications by
charge-state for both CID (4b) and ECD (4c), alongside the DTA
input proportion (4a). The observed lower success rate for ECD
identifications from 2+ precursors agrees with the previously
reported data for ETD of 2+ and 3+ precursors.14 We hypoth-
esized that neutral loss peaks from the charge-reduced precur-
sor which are not anticipated by the database search engine
might be detrimental to the identification of doubly charged
research articles
peptides. Rather than remove specific neutral losses from the
2+ DTAs, we chose to retain all potential true fragment ions
within 140 m/z of the charge-reduced precursor, specifically
all possible c, z, z-prime and y ions from tryptic peptides were
listed for retention. (The m/z values of potential true fragment
ions were calculated on the basis of the following known
parameters: m/z of the charge-reduced precursor, mass of
amino acid residues, and structures of c, z, z-prime and y ions.)
That resulted in a net increase in 2+ identification efficiency
of 3.5% and 1.5% for Mascot and OMSSA, respectively. The
cumulative effect of the three precursor trimming operations
on the number of 2+ identifications is an increase of 16% for
Mascot (from 870 to 1006) and 11% for OMSSA (733 to 813).
Figure 5 plots the Mascot scores of all 1006 2+ identifications
after removing the neutral loss peaks, alongside the scores prior
to the additional processing. For 98.5% of the previously
identified 2+ peptides, the same peptide was identified. In the
remaining 15 cases, the previously accepted identification was
relegated to second place, behind either a decoy hit or a hit
with an unacceptably large mass error (presumed false-
positive). In some cases, it was not clear why the Mascot score
of the previously accepted hit failed to increase as much as
the alternative identification. This is likely related to the overall
number of peaks present and the windows into which Mascot
divides the spectrum (D. Creasy, personal communication).
Large fragment ions from nontryptic peptides (e.g., C-terminal
peptides) could potentially be lost; however, in none of these
15 cases did we observe the loss of a real fragment ion from
the Mascot identification.
We compared the strategy described above (retaining all true
fragment ions) with the simpler approach of removing the
entire 140 m/z region below the [M + 2H]∙ peak, followed by
a Mascot search and postsearch filtering as before. The
identification rate was reduced, with 7% fewer identifications
from doubly charged precursors (936 vs 1006). The removal of
any of the 33 true c/z/y fragment ions which fall in this region
will be detrimental to identification.
Validation of Identifications. A large number of the accepted
ECD identifications have low Mascot peptide scores. On the
basis of the Mascot scores alone, these identifications would
usually be rejected; however, the FDR estimate suggests that
the majority are correct. We employ two strategies in order to
validate the ECD identifications: (1) check for agreement
between identifications from paired CID and ECD events; (2)
manually assess a small number of low-scoring ECD identifica-
tions, where no paired CID identification was made.
Of the 1643 ECD identifications, there are only 83 identifica-
tions without a paired CID identification, that is, 1560 (95%)
of the ECD identifications are from paired events which also
led to a CID identification. This high degree of overlap is due
to the high CID success rate (>70%), which is in turn attribut-
able to the fragmentation of high intensity (>40 000 counts)
tryptic peptides. The estimated FDR of less than 1% predicts
fewer than 15.6 false-positives in this ECD data set and the
same number in the CID data set. The FDR therefore predicts
fewer than 31.2 identification conflicts (as a false-positive for
either the ECD or CID identification will result in a conflict).
In fact we find 21 conflicts, well within the limit of31 expected
for a 1% FDR (Supplementary Table 2). Of these conflicts, 18
out of 21 are isobaric peptides, often with similar sequences
(e.g., VAPDEHPILLTEAPLNPK and VAPEEHPVLLTEAPLNPK).
To examine the distribution of conflicts, the 1560 ECD
identifications with paired CID identifications were divided
Journal of Proteome Research • Vol. 8, No. 12, 2009 5479