Database Search Strategies for Proteomic Data Sets
Exported Mascot results were sorted (in Excel) by descending
score, and then by protein accession (A to Z, in this case, to
ensure ###REV... is listed before IPI:...). This was to ensure that,
for a reverse and a forward hit of identical score, the reverse
hit was preferentially retained. Lower scoring identifications
were removed, leaving only the top scoring identification for
each MS/MS event (remove duplicates for Peptide Scan Title
column). The mass error in ppm was calculated for each
identification (using the charge-state, theoretical mass and
delta mass: delta mass/(theor. mass + chargestate × 1.00728)
× 1 000 000).
The OMSSA Browser 2.1.1 was employed to search the DTA
files. OMSSA settings were as previously described, with the
exception that phosphorylation was not considered as a
variable modification.5 For the ECD search, the ‘elimination
of charge-reduced precursors in spectrum’ option was selected.
Researching ECD Phosphopeptide Data Set. A total of 6080
ECD DTAs were processed as above, that is, removal of noise
peak, precursor window and neutral loss peaks. Database
searching was as for unmodified (above), but allowing STY
phosphorylation as a variable modification. Postsearch filtering
was as above.
Results
To test the effect of various search-related parameters, we
employed a test data set consisting of 3341 high quality ECD
mass spectra obtained from the LC-MS/MS analysis of mouse
whole cell lysate. Paired ion trap CID and FT-ICR ECD mass
spectra were acquired, as previously described.4,5 The mouse
IPI database was searched; a concatenated forward-reverse
version of this database was employed, unless stated otherwise.
In all cases, the false-discovery rate (FDR) as estimated by the
number of accepted reverse identifications was controlled at
less than 1%. Full details of the peptide identifications are
supplied as Supplementary Data.
Initial Search. An initial search, without preprocessing of
the CID or ECD data, was carried out using both search
engines: Mascot and OMSSA. The precursor mass tolerance was
set to 0.02 m/z (OMSSA) or 10 ppm (Mascot).
For the initial Mascot search, a forward-only version of the
mouse IPI database was employed, in combination with the
Mascot ‘decoy’ option. The ‘decoy’ option automatically carries
out a second search using a randomized database, and thereby
gives an estimate of FDR. However, adjusting the FDR to a
particular value (1%) was not possible. The search resulted in
633 ECD identifications and 1712 CID identifications. To better
control the estimated FDR, we repeated the Mascot search,
without “decoy” option, using the concatenated version of the
database (as used in all subsequent searches), exported all
results into Excel, and manually filtered according to Mascot
score. That resulted in a doubling of the number of accepted
identifications, as shown in Table 1 (ECD Search: row 1 versus
row 3). Manually filtered Mascot and OMSSA searches give
similar numbers of identifications for both ECD and CID data
sets. The identification rates reach 38% for ECD data (1254
identifications) and 69% for CID data (2297 identifications).
Clearly, there is a considerable difference, of approximately
30%, in identification success rate between CID and ECD mass
spectra.
Postsearch Filtering by Precursor Mass Error. Database
searches employing a wide precursor mass tolerance window,
with subsequent filtering of results, have previously been shown
to improve identification rates.5,10 While the benefits of post-
research articles
Table 1. Initial Searches of ECD and CID Data Filtered
According to Database Search Algorithm Scorea
search |
postsearch |
forward |
reverse |
ID rate |
ECD Search (3341 DTAs) | ||||
Mascot; 10 ppm precursor; |
633 |
2* |
18.9 | |
OMSSA; 0.02 Da precursor |
Peptide score |
1190 |
11 |
35.6 |
Mascot; 10 ppm precursor |
Peptide score |
1254 |
12 |
37.5 |
CID Search (3341 DTAs) | ||||
Mascot; 10 ppm precursor; |
1712 |
16* |
51.2 | |
OMSSA; 0.02 Da precursor |
Peptide score |
2297 |
22 |
68.8 |
Mascot; 10 ppm precursor |
Peptide score |
2283 |
22 |
68.3 |
a DTA files are unaltered. Asterisks indicate “decoy” hits, from Mascot
“Decoy” search option.
Table 2. Searches of ECD and CID Data in Which a Wider
Precursor Mass Tolerance Window Was Combined with
Postsearch Precursor Mass Error Filteringa
search |
postsearch |
forward |
reverse |
ID rate |
ECD Search (3341 DTAs) | ||||
OMSSA; 1.1 |
Precursor ppm error |
1447 |
14 |
43.3 |
Da precursor |
and peptide score | |||
Mascot; 1.1 |
Precursor ppm error |
1468 |
7 |
43.9 |
Da precursor | ||||
CID Search (3341 DTAs) | ||||
OMSSA; 1.1 |
Precursor ppm error |
2344 |
9 |
70.2 |
Da precursor | ||||
Mascot; 1.1 |
Precursor ppm error |
2385 |
9 |
71.4 |
Da precursor |
a To achieve the estimated FDR of 1%, results were filtered according
to database search algorithm scores where necessary (E-value cutoff of
8.01 × 10-1 for OMSSA ECD search).
search filtering are well-established, we were interested in
comparing the magnitude of its effect with the other levels of
data processing described here and the effectiveness for ECD
compared to CID data. We therefore repeated the above
searches with a precursor mass tolerance of 1.1 Da and
exported all results for subsequent manual filtering of identi-
fications by precursor error (in ppm) and, if the selected ppm
range contains more reverse hits than compatible with a 1%
FDR, by peptide score. This resulted in an increased number
of accepted identifications for all searches (Table 2). We note
that the increase in identification efficiency for ECD data is
greater than that for CID data, for example, increases of 6.4%
and 3.1%, for ECD and CID searches using Mascot. This
characteristic may be the result of the lower peptide scores for
ECD identifications (Mascot average score of 23 versus 40, for
ECD (n ) 1468) and CID (n ) 2385), respectively), that is, the
true identifications are less readily distinguished from reverse
hits on the basis of peptide score alone.
Previous work has shown that it is possible for the precursor
mass recorded in the DTA file to correspond to the second or
third isotopic peak (i.e., one or two 13C more than the
monoisotopic peak).11 This occurrence is particularly common
for low resolution ion-trap only experiments. If this occurs,
identifications can be rescued by searching with a larger
precursor tolerance window (with subsequent narrow mass
filtering around the offset precursor mass). We compared
searches with 1.1, 2.1, and 3.1 Da tolerances. In none of the
cases was there a high-scoring identification resulting from
Journal of Proteome Research • Vol. 8, No. 12, 2009 5477