---
title: "CorMID"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{CorMID}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

Please load the `CorMID` package before running the examples within this document.

```{r setup}
library(CorMID)
```

# The CorMID function to correct superimposed mass isotopologue distributions of ^13^C labelled molecules in GC-APCI-MS flux experiments

The package provides the main function `CorMID` which will estimate the fragment distribution, *r*, and the mass isotopologue distribution, *M*, for a numeric vector of measured ion intensities originating from a compound analyzed by GC-APCI-MS (see below).

## Definitions

### Gas chromatography (GC)

Gas chromatography is an analytical method to separate volatile compounds.

### Atmospheric pressure chemical ionization (APCI)

Atmospheric pressure chemical ionization (APCI) is an ionization method used in mass spectrometry which utilizes gas-phase ion-molecule reactions at atmospheric pressure (10^5^ Pa) to convert molecules separated by GC into ions which can be analyzed by mass spectrometry.

### Mass spectrometry (MS)

Mass spectrometry is an analytical method to separate ions due to their mass, more specifically due to their difference in mass to charge ratio.

### Mass Isotopologue Distributions (MID)

Compounds which have the same sum formula are termed isomers. Isomers can be structurally different, *i.e.* the arrangement of the atoms in the molecules are different, or structurally identical. All isomers of a specific sum formula which are structurally identical but differ in their isotopic composition are termed isotopomers. The specific case of isotopomers with identical isotopic composition but differences of the position of these isotopes are termed isotopologues.

Let us look at an example: C~2~H~6~O would be a sum formula, with ethanol (CH~3~-CH~2~-OH) and dimethyl ether (CH~3~-O-CH~3~) being two structural isomers of C~2~H~6~O. The molecules ^13^CH~3~-CH~2~-OH and CH~3~-CH~2~-OH would be two isotopomers of ethanol as they differ in their isotopic composition. The molecules ^13^CH~3~-CH~2~-OH and CH~3~-^13^CH~2~-OH would be isotopologues of ethanol. Together, they represent the M1 isotopologue of ethanol with the common sum formula ^13^C^12^CH~6~O. The corresponding M2 isotopologue would be ^13^C~2~H~6~O. In mass spectrometry we measure the mass isotopologue distribution of C~2~H~6~O, more specifically of the structural isomer ethanol because we can separate other structural isomers during the chromatography preceding the measurement. Hence, the mass isotopologue distribution (MID) is an intensity vector of 3 different ion masses, M+0, M+1 and M+2. In this simple case of two carbons and neglecting ^17^O these three isotopologues represent only 4 different molecules. However, with increasing carbon number the situation is getting more complex. :)

### Helper functions

Within the main function `CorMID` we make use of several helper functions, *i.e.* `CountChemicalElements` and `CalcTheoreticalMDV`. The first one simply counts the digit following a certain letter in a chemical sum formula. Here, we use it to determine the number of carbon, silicon and sulfur atoms (neglecting nitrogen, as the ^15^N isotope is of low abundance). As the anticipated user will probably work on derivatized compounds, two additional letters were included to the chemical alphabet, **T** for TMS and **M** for a MEOX substitution. In consequence for compound Glucose (5 TMS, 1 MEOX) we would count:

```{r CountChemicalElements}
fml <- "C6H12O6T5M1"
CorMID::CountChemicalElements(x = fml)
```

and receive as output a named vector for all present elements. Specifying parameter *ele*, we can also count the occurence of a selection of elements.

```{r CountChemicalElements2}
CorMID::CountChemicalElements(x = fml, ele = c("C", "Si", "T", "Cl"))
```

The elements with a significant amount of natural occurring isotopes are relevant to calculate the theoretical mass distribution vector (or rather matrix respectively) of the compound. In the above example, the relevant elements are effectively carbon and silicon. 

```{r, echo=FALSE, ShowNaturalIsotopeAbundance}
structure(list(
  element = c("H", "H", "C", "C", "O", "O", "O", "Si", "Si", "Si"), 
  isotope = c("1H", "2H", "12C", "13C", "16O", "17O", "18O", "28Si", "29Si", "30Si"), 
  mass = c(1.0078, 2.0141, 12, 13.0034, 15.9949, 16.9991, 17.9992, 27.9769, 28.9765, 29.9738), 
  `abundance [%]` = c(99.99, 0.01, 98.93, 1.07, 99.76, 0.04, 0.21, 92.22, 4.69, 3.09), 
  `abund > 1%` = c(TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE)
), row.names = c(NA, -10L), class = "data.frame")
```

As we have 5 TMS groups, we need to consider in total 21 carbon (6 of biological origin and 15 in TMS groups) and 5 silicon atoms in our calculations.

```{r CalcTheoreticalMDV1}
fml <- "C21Si5"
td <- CorMID::CalcTheoreticalMDV(fml = fml)
round(td, 4)
```

The first row of the matrix (M0) gives the relative amounts of all potential isotopes for C~21~Si~5~ assuming natural abundance conditions. The second row (M1) shows the relative amounts for isotopologue M1 (containing at least one ^13^C at any position). The final row (M6) shows the relative amounts when all biological carbon atoms are assumed to be ^13^C. The amount of biological carbon is estimated based on the amount of Si within the function (21 - 5 x 3 = 6). This might be overwritten by function parameters specifying the number of C of biological origin *nbio* and the number of measured ion signals above the detection limit *nmz*:

```{r CalcTheoreticalMDV2}
round(CorMID::CalcTheoreticalMDV(fml = fml, nbio = 21, nmz = 21)[-(5:19), -(5:19)], 4)
```

Further, the package contains the convenience function `recMID` to reconstruct a measured MID based on a given *corMID,* *r* and sum formula. `recMID` returns an object of the similarly named class to allow easy visualization by a class dependent plot method.

```{r recMID}
fml <- "C9H20O3Si2"
mid <- c(0.9, 0, 0, 0.1)
r <- list("M+H" = 0.8, "M-H" = 0.1, "M+H2O-CH4" = 0.1)
rMID <- CorMID::recMID(mid = mid, r = r, fml = fml)
round(rMID, 4)
plot(rMID)
```

The spectrum shown in the above plot would be measured for lactic acid (2 TMS), assuming 10% of the fully labeled isotopologue M3, natural abundance, 10% proton loss and 10% of [M+H]+H~2~O-CH~4~. It is the task of `CorMID` to disentangle this overlay of superimposed MIDs and estimate both, *corMID* and *r*.

## Main function `CorMID`

### Idea

The problem in GC-APCI-MS that we try to overcome is the formation of fragments forming superimposed MIDs. The ones we identified so far are [M+H], [M+], [M+H]-H~2~ and [M+H]+H~2~O-CH~4~. If we assume [M+H] to be generally the most abundant and hence use it as our fix point (base MID, shift = 0), than we observe superimposed MIDs starting at -2, -1 and +2 relative to [M+H] for [M+], [M+H]-H~2~ and [M+H]+H~2~O-CH~4~ respectively.

The basic idea of the correction is that we measure a superimposed/composite MID of one to several fragments all derived from the same base MID. This base MID (or correct MID, *corMID*) is exactly what we are looking for. Estimating the *corMID* is complicated because we do not know the distribution of fragments, *i.e.* the amount of the individually occurring fragments or their ratios to each other respectively. Hence, we have to estimate the *corMID* and the ratio vector *r* which in combination fit our measurement best.

### Example

Lets start with an artificial Glucose spectrum where 10% is M6 labeled:

```{r CorMID1}
fml <- "C21Si5"
td1 <- CorMID::CalcTheoreticalMDV(fml = fml, nbio = 6, nmz = 8)
bMID <- c(0.9, rep(0, 5), 0.1)
md1 <- apply(td1*bMID, 2, sum)
round(md1, 4)
```

**md1** represents the measured isotopologue distribution which is equivalent to the vector of measured intensity values normalized to the vector sum. Please note that the measured MID contains additional peaks at M+7 and M+8, caused by the natural abundant isotopes of carbon atoms attached during derivatization. Now we may use function `CorMID` to disentangle this vector.

```{r CorMID2}
CorMID::CorMID(int=md1, fml=fml, r="M+H")
```

Notice, that we allowed only [M+H] to be present in option *r*. The result is a labeled vector representing the corrected MID (or base MID) and attributes providing information on the fitting error *err* and the parameters *r*atio, *ratio_status* and *mid_status* as used in the function call. Please note that during the function call *mid* was estimated and *r*atio was fixed.

We could achieve something similar testing for all currently defined fragments by omitting the *r* option:

```{r CorMID3}
CorMID::CorMID(int=md1, fml=fml)
```

Here, we essentially get the same result as before (except for *ratio* related attributes) because there is no superimposition in our test data. *ratio* was estimated and other possible adducts were tested but found to be of zero presence.  Now lets generate more difficult composite data **md2** to be fit by including a 20% proton loss (or "[M+]" or "M-1", respectively) on top of **md1**.

```{r CorMID4}
md2 <- unlist(list("M-1" = 0, 0.8*md1)) + c(0.2*md1, 0)
round(md2, 4)
```

We could have done the same with the convenience function *recMID*:

```{r CorMID4altern}
fml <- "C21Si5"
bMID <- c(0.9, rep(0, 5), 0.1)
r <- list("M+H" = 0.8, "M+" = 0.2)
rMID <- CorMID::recMID(mid = bMID, r = r, fml = fml)
round(rMID, 4)
plot(rMID, ylim=c(0, 0.45))
```

and let `CorMID` decompose this back...

```{r CorMID5}
CorMID::CorMID(int=md2, fml=fml)
```

which is pretty close to the truth. :)

### More Function Details

Finally, let's look into the mathematical details of the function. Apart from some sanity checks and data preparation steps done by the wrapper function `CorMID`, the main idea is to model a theoretical distribution based on a provided sum formula and fit a base MID and fragment ratios according to measurement data by function `FitMID` which is discussed in the following. The approach is brute force using two nested estimators for *r* and *corMID* separately. It builds on the idea to test a crude grid of parameters first, identify the best solution and use an iterative method minimizing the grid to approach the true value.

The grid is set by an internal function `poss_local`. Basically, if we have a two carbon molecule we expect a *corMID* of length=3 {M0, M1, M2}. Let's assume that *corMID* = {0.9, 0, 0.1}. Using a wide grid (step size d= 0.5) we would than test the following possibilities:

```{r poss_local_demo1}
CorMID:::poss_local(vec=c(0.5,0.5,0.5), d=0.5, length.out=3)
```

and identify {1, 0, 0} as best match after subjecting to a testing function. Taking the best match as our new starting point, we decrease the step size of the grid by 50% and test in the next iteration:

```{r poss_local_demo2}
CorMID:::poss_local(vec=c(1,0,0), d=0.25, length.out=3)
```

and will get closer to the truth and find {0.875, 0, 0.125} to give the lowest error.

In summary, using this approach we can approximate the optimal vectors of *corMID* and *r* in a finite number of iterations to reach a desired precision \<0.1%. We can nest MID fitting inside ratio fitting and thereby do both in parallel.

The error function currently employed is simply the square root of the summed squared errors, comparing the provided measurement data and a reconstructed MID based on a specific *corMID* and *r*.

## Wrap up

Thanks for following till the end. I hope this framework is helpful for you.