Introduction

With the ability to identify and precisely quantify thousands of proteins from complex samples, liquid chromatography (LC)-tandem mass spectrometry (MS/MS) has been the most widely used tool for proteomic studies over the past decadesUniProt. The mouse data were searched against the SwissProt Mus musculus database (access date 2019-02, 17,006 entries). Q-value cutoff on precursor and protein level was applied 1%. Other parameters were default values.

Training and validation of the deep neural networks

HCD MS/MS spectra of peptide precursors were collected from the HeLa1 (17 runs), HeLa2 (12 runs), Mouse1 (23 runs), and Mouse2 (15 runs) DDA data (Supplementary Table Anaconda distribution version 4.2.0) using Keras (version 2.2.4) with TensorFlow (version 1.11.0) backend. Data preprocessing and visualization were conducted with R (version 3.5.1). Running time for model training is described in Supplementary Note Protein Digestion Simulator (version 2.2.6794). For in silico libraries without detectability filtering, Trypsin and Trypsin/P were set as the digestion enzyme with no missed cleavages, respectively, and the results were combined. For libraries with detectability filtering, Trypsin/P was set as the digestion enzyme with missed cleavages ≤2. Only peptides with length from 7 to 50 amino acids with mass ≤ 6000 Da were kept.

DIA data analysis

Raw data of DIA were processed and analyzed by Spectronaut. Retention time prediction type was set to dynamic iRT. Data extraction was determined by Spectronaut based on the extensive mass calibration. Decoy generation was set to mutated. Interference correction on MS2 level was enabled. Peptide and protein level Q-value cutoff was set to 1%. For mixed proteome samples, SwissProt H. sapiens isoform database (access date 2018-06, 42,356 entries), UniProt Proteome C. elegans isoform database (access date 2019-03, 28,302 entries), SwissProt S. cerevisiae (strain ATCC 204508 / S288c) database (access date 2019-03, 6,721 entries) and SwissProt E. coli (strain K12) database (access date 2019-03, 4,480 entries) were used as protein sequence databases. For other datasets, protein database was set the same as those used in DDA searching.

For large spectral libraries, machine learning was performed across experiments, and protein groups with single hit (i.e. only one stripped peptide sequence) in each run were excluded. An entrapment strategy37 was used to compare false positive identification rates under the given Q-value. An entrapment library was built using proteins from other organisms with roughly equivalent size to the organism specific library (see Supplementary Table 2 for details). The organism specific library and the entrapment library were merged and used as the target library. Identification results were filtered by 1% Q-value by a target-decoy approach implemented by Spectronaut. The generation of decoy was on the whole target library including entrapment. As we introduced the entrapment entries in the target database, the entrapment hits in filtered target hits were considered as false positive results. Thus, we used entrapment percentage (percentage of the number of entrapment hits to the target hits) to compare the false positive rates relatively. It should be noted that the true error rate is higher than the entrapment percentage.

Peptide and protein reports were exported as CSV files, and subsequent statistic and visualization were performed with R scripts.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this Article.