--- title: "Thermal proteome profiling with MSTherm" author: "Jeremy Volkening" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: fig_caption: yes vignette: > %\VignetteIndexEntry{Thermal proteome profiling with MSTherm} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # MSTherm The `mstherm` package is used for modeling and analysing thermal proteome profiling (TPP) experiments. For more details on the motivation and methodology behind such experiments, please see [Savitski et al, 2014](https://doi.org/10.1126/science.1255784). Analyzing a TPP experiment in `mstherm` involves generating table of quantative temperature series data for each replicate, setting up metadata files describing the experiment, creating an MSThermExperiment object, normalizing the data, modeling, visualization, and data export. Each of these steps will be described below as a sample dataset is analyzed. ## Input files ### Spectral data The primary input to `mstherm` is a set of tab-delimited data tables containing quantative data from the experiment, one file per replicate. The table consists of a header line followed by one line per peptide spectral match. It can contain any number of columns, but certain columns must be present (and labeled properly) to be used with `mstherm`. Specifically, the folowing columns are required to be present: * `peptide` - the sequence of the matched peptide in single-letter IUPAC * `protein` - the protein group the peptide is assigned to * `...` - one column per temperature point, with labels matching those in the `meta` table described below The following columns are optional but contain information that can be used for filtering (all other columns are simply ignored): * `score` - the score assigned by the search engine or processing software (e.g. Percolator, PeptideProphet) to the peptide spectral match * `coelute_inf` - the fraction of the signal in the precursor window derived from off-target ions (e.g. the percent co-isolation interference, from 0.0-1.0) These data files are generally produced by whatever quantification software is used, but may need manual or batch processing to apply the correct column labels. ### Metadata Although `mstherm` has some capabilities for reading metadata from existing dataframes, typically this information will be read from files on disk. While these files can be arranged in any way the user wishes, it is recommended that a new directory be set up for each project with the following layout: ``` ├── control.tsv ├── data │   ├── Control_R1.tsv │   ├── Control_R2.tsv │   ├── Control_R3.tsv │   ├── Treated_R1.tsv │   ├── Treated_R2.tsv │   └── Treated_R3.tsv └── meta    └── tags.meta ``` The `control.tsv` file is a tab-delimited table describing the experiment and paths to relevant files. An example of a control file for the above experiment would be as follows: ``` name sample data_file meta_file C1 Control data/Control_R1.tsv meta/tags.meta C2 Control data/Control_R2.tsv meta/tags.meta C3 Control data/Control_R3.tsv meta/tags.meta T1 Treated data/Treated_R1.tsv meta/tags.meta T2 Treated data/Treated_R2.tsv meta/tags.meta T3 Treated data/Treated_R3.tsv meta/tags.meta ``` Paths in this file are relative to the location of the control file itself. In this experiment, all samples used the same quantitative methodology and so all share the same meta file, but any number of meta files can be used. The `meta_file` is another tab-delimited table that maps quantitative column labels in the data files to temperature points. An example for an experiment utilizing TMT-10plex isobaric labeling might look as follows: ``` channel temp TMT10.126 28.2 TMT10.127N 31.3 TMT10.127C 35.0 TMT10.128N 38.5 TMT10.128C 42.1 TMT10.129N 45.7 TMT10.129C 49.1 TMT10.130N 52.2 TMT10.130C 56.2 TMT10.131 59.4 ``` The channel labels can be anything the user chooses as long as they match with the appropriate columns in the data file. In all of these metadata files, the column labels must match those shown above. ### Annotations If available, a tab-delimited table containing descriptions for each protein group can be imported and used for plot labels during visualization. This table should have two columns labeled `name` and `annotation`. ## Initializing an analysis With the data files in hand as described, an `mstherm` session is started as follows: ```{r} library(mstherm) control <- system.file("extdata", "demo_project/control.tsv", package="mstherm") annots <- system.file("extdata", "demo_project/annots.tsv", package="mstherm") expt <- MSThermExperiment(control, annots) ``` This will read the control file and load all necessary quantitative data and metadata into the MSThermExperiment object. This is the step in which the table of annotations can be provided if available (this information is only used to provide descriptive labels on protein melting plots). Here we are using files included in the demo of this package. ## Normalization Typically, after importing the raw quantitiative data a normalization step is performed to correct for differences in sampling handling and other sources of variance that can introduce variance from channel to channel. There are two methods available in `mstherm` to perform this step, `normalize_to_std()` and `normalize_to_profile()`. `normalize_to_std()` requires that a protein spike-in be added equally to all samples directly after gradient precipitation, and the method calculates and applies scaling factors to each channel such that the values for the spike-in are roughly equal across channels. The normalization is applied independently to each replicate. In the demo data, BSA was spiked in to each sample, and we use it here to apply the normalization method (using the protein ID which matches that used in the data tables): ```{r, fig.cap="Summary of normalization"} expt <- normalize_to_std(expt, "cRAP_ALBU_BOVIN") ``` This performs the normalization as well as producing a summary plot for QC purposes (which can be captured to PDF, etc, by changing the output device before the above step). In the plots above, green points represent the original channel intensity sums (relative to the first temperature), the red points show relative quantification of spike-in standard pre-normalization (with IQR indicated by red bars), the blue points show normalized relative channel sums, and the blue curve is final normalization curve used to generate scaling factors. The other available normalization method, `normalize_to_profile()`, takes a vector of values equal in length to the number of quantification channels and calculates and applies scaling factors to each channel such that the ratios between sums of quantification values for each channel match the ratios between values given in the vector. Such values might correspond to, for example, measured total protein concentration in each sample after gradient precipitation. ## Modeling The temperature series data is attempted to be fit to a logistic curve model for two-state unfolding according the following formula: $$Pn = \frac{1 - \textit{p}}{1 + e^{-k(\frac{1}{T}-\frac{1}{m})}} + \textit{p}$$ where $T$ is the temperature and $m$, $k$, and $p$ are estimated parameters ($m$ corresponds to the melting temperature, $p$ the lower plateau of the curve, and $k$ contributes to the slope of the curve). Modeling of the data is performed with the following command: ```{r, message=FALSE, results="hide"} res <- model_experiment(expt, bootstrap=T, smooth=T, min_rep_psm=3, np=1) ``` There are many, many options available to this command which control various aspects of modeling and filtering (see the documentation for full details). As modeling can be slow (especially if you have data for several thousand proteins and bootstrapping is turned on), the package can utilize parallel processing to speed things up. This is set by the `np` parameter, which by default will use all available threads on the machine. ## Inter-replicate normalization A second round of normalization can be performed to correct for differences in sample handling between replicates in experiments where most proteins are not expected to be changing and the global distribution of protein Tms is expected to be constant. Doing so requires that the dataset first be modeled as above to estimate Tms for each protein. A single replicate is then chosen as the baseline and correction factors are calculated for all other replicates based on linear regression between a set of high-quality melting temperature estimates. These linear corrections are applied to the intial temperature vector for each replicate and a new MSThermExperiment object is returned which can be re-modeled as above. The command to perform this inter-replicate normalization is: ```{r, message=FALSE, results="hide"} expt <- normalize_to_tm( expt, res ) res <- model_experiment(expt, bootstrap=T, smooth=T, min_rep_psm=3, np=1) ``` ## Visualization Melting plots can be produced for each protein in the result set. This can be done either for an individual protein or for the whole set at once. For example, the following command will generate a plot for protein `P38707` of the test data: ```{r} plot(res$P38707) ``` Again, there are a number of options that can be passed to the `plot()` command to control what is shown on the plots (by default, most options are turned on). The `plot()` command can also be passed a list of proteins to plot, or no arguments at all (in which case all proteins are plotted sequentially). For example, the following will produce a PDF file with each protein plot on a separate page: ```{r, eval=FALSE} pdf("models.pdf", 5, 5, pointsize=10, useDingbats=F) plot(res) dev.off() ``` ## Data Export The `as.data.frame()` method produces a summary dataframe containing melting temperatures, model parameter estimates, and other descriptive values for each protein that can be used for further analysis in R or written to file using the usual methods (e.g. `write.table()`). The output from this command contains the following columns (row labels are protein group IDs): * annotation - taken from annotation table, if provided, or else empty * for each replicate in the experiment: * tm - melting temperature (the $m$ parameter in the model) * psm - number of PSMs used for protein-level quantification * inf - protein-level estimate of co-isolation interference * slope - slope of the model at the inflection point * k - the $k$ parameter estimate of the model * plat - the $p$ or "plateau" parameter estimate of the model * r2 - the coefficient of determination of the model fit * rmsd - the root-mean-square deviation of the model fit