The mstherm
package is used for modeling and analysing
thermal proteome profiling (TPP) experiments. For more details on the
motivation and methodology behind such experiments, please see Savitski et al, 2014.
Analyzing a TPP experiment in mstherm
involves generating
table of quantative temperature series data for each replicate, setting
up metadata files describing the experiment, creating an
MSThermExperiment object, normalizing the data, modeling, visualization,
and data export. Each of these steps will be described below as a sample
dataset is analyzed.
The primary input to mstherm
is a set of tab-delimited
data tables containing quantative data from the experiment, one file per
replicate. The table consists of a header line followed by one line per
peptide spectral match. It can contain any number of columns, but
certain columns must be present (and labeled properly) to be used with
mstherm
. Specifically, the folowing columns are required to
be present:
peptide
- the sequence of the matched peptide in
single-letter IUPACprotein
- the protein group the peptide is assigned
to...
- one column per temperature point, with labels
matching those in the meta
table described belowThe following columns are optional but contain information that can be used for filtering (all other columns are simply ignored):
score
- the score assigned by the search engine or
processing software (e.g. Percolator, PeptideProphet) to the peptide
spectral matchcoelute_inf
- the fraction of the signal in the
precursor window derived from off-target ions (e.g. the percent
co-isolation interference, from 0.0-1.0)These data files are generally produced by whatever quantification software is used, but may need manual or batch processing to apply the correct column labels.
Although mstherm
has some capabilities for reading
metadata from existing dataframes, typically this information will be
read from files on disk. While these files can be arranged in any way
the user wishes, it is recommended that a new directory be set up for
each project with the following layout:
├── control.tsv
├── data
│ ├── Control_R1.tsv
│ ├── Control_R2.tsv
│ ├── Control_R3.tsv
│ ├── Treated_R1.tsv
│ ├── Treated_R2.tsv
│ └── Treated_R3.tsv
└── meta
└── tags.meta
The control.tsv
file is a tab-delimited table describing
the experiment and paths to relevant files. An example of a control file
for the above experiment would be as follows:
name sample data_file meta_file
C1 Control data/Control_R1.tsv meta/tags.meta
C2 Control data/Control_R2.tsv meta/tags.meta
C3 Control data/Control_R3.tsv meta/tags.meta
T1 Treated data/Treated_R1.tsv meta/tags.meta
T2 Treated data/Treated_R2.tsv meta/tags.meta
T3 Treated data/Treated_R3.tsv meta/tags.meta
Paths in this file are relative to the location of the control file
itself. In this experiment, all samples used the same quantitative
methodology and so all share the same meta file, but any number of meta
files can be used. The meta_file
is another tab-delimited
table that maps quantitative column labels in the data files to
temperature points. An example for an experiment utilizing TMT-10plex
isobaric labeling might look as follows:
channel temp
TMT10.126 28.2
TMT10.127N 31.3
TMT10.127C 35.0
TMT10.128N 38.5
TMT10.128C 42.1
TMT10.129N 45.7
TMT10.129C 49.1
TMT10.130N 52.2
TMT10.130C 56.2
TMT10.131 59.4
The channel labels can be anything the user chooses as long as they match with the appropriate columns in the data file. In all of these metadata files, the column labels must match those shown above.
If available, a tab-delimited table containing descriptions for each
protein group can be imported and used for plot labels during
visualization. This table should have two columns labeled
name
and annotation
.
With the data files in hand as described, an mstherm
session is started as follows:
library(mstherm)
control <- system.file("extdata", "demo_project/control.tsv", package="mstherm")
annots <- system.file("extdata", "demo_project/annots.tsv", package="mstherm")
expt <- MSThermExperiment(control, annots)
This will read the control file and load all necessary quantitative data and metadata into the MSThermExperiment object. This is the step in which the table of annotations can be provided if available (this information is only used to provide descriptive labels on protein melting plots). Here we are using files included in the demo of this package.
Typically, after importing the raw quantitiative data a normalization
step is performed to correct for differences in sampling handling and
other sources of variance that can introduce variance from channel to
channel. There are two methods available in mstherm
to
perform this step, normalize_to_std()
and
normalize_to_profile()
. normalize_to_std()
requires that a protein spike-in be added equally to all samples
directly after gradient precipitation, and the method calculates and
applies scaling factors to each channel such that the values for the
spike-in are roughly equal across channels. The normalization is applied
independently to each replicate. In the demo data, BSA was spiked in to
each sample, and we use it here to apply the normalization method (using
the protein ID which matches that used in the data tables):
This performs the normalization as well as producing a summary plot for QC purposes (which can be captured to PDF, etc, by changing the output device before the above step). In the plots above, green points represent the original channel intensity sums (relative to the first temperature), the red points show relative quantification of spike-in standard pre-normalization (with IQR indicated by red bars), the blue points show normalized relative channel sums, and the blue curve is final normalization curve used to generate scaling factors.
The other available normalization method,
normalize_to_profile()
, takes a vector of values equal in
length to the number of quantification channels and calculates and
applies scaling factors to each channel such that the ratios between
sums of quantification values for each channel match the ratios between
values given in the vector. Such values might correspond to, for
example, measured total protein concentration in each sample after
gradient precipitation.
The temperature series data is attempted to be fit to a logistic curve model for two-state unfolding according the following formula:
$$Pn = \frac{1 - \textit{p}}{1 + e^{-k(\frac{1}{T}-\frac{1}{m})}} + \textit{p}$$
where T is the temperature and m, k, and p are estimated parameters (m corresponds to the melting temperature, p the lower plateau of the curve, and k contributes to the slope of the curve).
Modeling of the data is performed with the following command:
There are many, many options available to this command which control
various aspects of modeling and filtering (see the documentation for
full details). As modeling can be slow (especially if you have data for
several thousand proteins and bootstrapping is turned on), the package
can utilize parallel processing to speed things up. This is set by the
np
parameter, which by default will use all available
threads on the machine.
A second round of normalization can be performed to correct for differences in sample handling between replicates in experiments where most proteins are not expected to be changing and the global distribution of protein Tms is expected to be constant. Doing so requires that the dataset first be modeled as above to estimate Tms for each protein. A single replicate is then chosen as the baseline and correction factors are calculated for all other replicates based on linear regression between a set of high-quality melting temperature estimates. These linear corrections are applied to the intial temperature vector for each replicate and a new MSThermExperiment object is returned which can be re-modeled as above. The command to perform this inter-replicate normalization is:
Melting plots can be produced for each protein in the result set.
This can be done either for an individual protein or for the whole set
at once. For example, the following command will generate a plot for
protein P38707
of the test data:
Again, there are a number of options that can be passed to the
plot()
command to control what is shown on the plots (by
default, most options are turned on). The plot()
command
can also be passed a list of proteins to plot, or no arguments at all
(in which case all proteins are plotted sequentially). For example, the
following will produce a PDF file with each protein plot on a separate
page:
The as.data.frame()
method produces a summary dataframe
containing melting temperatures, model parameter estimates, and other
descriptive values for each protein that can be used for further
analysis in R or written to file using the usual methods
(e.g. write.table()
). The output from this command contains
the following columns (row labels are protein group IDs):