This document is a user-friendly manual on how to use the Treefit
package, which is the first toolkit for quantitative trajectory
inference using single-cell gene expression data. In this tutorial, we
demonstrate how to generate and analyze two kinds of toy data with the
aim to help get familiar with the practical workflow of Treefit. After
learning some basics, we will use Treefit to perform more biologically
interesting analysis in the next tutorial
(vignette("working-with-seurat")).
While the Treefit package has been developed to help biologists who wish to perform trajectory inference from single-cell gene expression data, Treefit can also be viewed as a toolkit to generate and analyze a point cloud in d-dimensional Euclidean space (i.e., simulated gene expression data).
Treefit provides some useful functions to generate artificial
datasets. For example, as we will now demonstrate, the function
treefit::generate_2d_n_arms_star_data() creates data that
approximately fit a star tree with the number of arms or branches; the
term star means a tree that looks like the star symbol “*”; for example,
the alphabet letters Y and X can be viewed as star trees that have three
and four arms, respectively.
The rows and columns of the generated data correspond to data points (i.e., n single cells) and their features (i.e., expression values of d different genes), respectively. The Treefit package can be used to analyze both raw count data and normalized expression data, but regarding the production of toy data, it is meant to be used to generate continuous data like normalized expression values.
Importantly, we can generate data with a desired level of noise by
changing the value of the fatness parameter of this
function. For example, if you set the fatness parameter to
0.0 then you will get precisely tree-like data without
noise. By contrast, setting the fatness to 1.0
gives very noisy data that are no longer tree-like. In this tutorial, we
deal with two types of toy data whose fatness values are
0.1 and 0.8, respectively. We note that
Treefit can be used to generate and analyze high dimensional datasets
but we focus on generating 2-dimensional data to make things as simple
as possible in this introductory tutorial.
Let us first generate 2-dimensional tree-like data that contain 500 data points and reasonably fit a star tree with three arms. We can create such data and draw a scatter plot of them (Figure 1) simply by executing the following two lines of code:
Similarly, we can generate noisy data that do not look so tree-like
as the previous one. In this tutorial, we simply change the value of the
fatness parameter from 0.1 to 0.8
in order to obtain non-tree-like data. The following two lines of code
yields a scatter plot shown in Figure 2:
Having generated two datasets whose tree-likeness are different, we can analyze each of them. The Treefit package allows us to estimate how well the data can be explained by tree models and to predict how many “principal paths” there are in the best-fit tree. As shown in Figure 1, the points in the first data clearly form a shape like the letter “Y” and so the data are considered to fit a star tree with three arms very well, whereas Figure 2 indicates that the second data are no longer very tree-like because of the high level of added noise. Our goal in this tutorial is to reproduce these conclusions by using Treefit.
Let us estimate the goodness-of-fit between tree models and the first
toy data. This can be done by using treefit::treefit() as
follows. The name parameter is optional but we should
specify it, if possible. Because it’s useful to identify the
estimation.
fit.tree_like <- treefit::treefit(list(expression=star.tree_like),
                                  name="tree-like")
# Save the analytsis result to use other tutorials.
saveRDS(fit.tree_like, "fit.tree_like.rds")
fit.tree_like## $name
## [1] "tree-like"
## 
## $max_cca_distance
##     p       mean standard_deviation
## 1   1 0.77898409        0.211442065
## 2   2 0.08458309        0.017835176
## 3   3 0.07159228        0.009670379
## 4   4 0.06322331        0.007688571
## 5   5 0.05823659        0.005875230
## 6   6 0.05521222        0.005803452
## 7   7 0.05247373        0.006368379
## 8   8 0.04981124        0.006581161
## 9   9 0.04754964        0.006479259
## 10 10 0.04606810        0.006140129
## 11 11 0.04413324        0.006067718
## 12 12 0.04270099        0.006209621
## 13 13 0.04028669        0.005663627
## 14 14 0.03917855        0.005915761
## 15 15 0.03806571        0.005944876
## 16 16 0.03704594        0.005766601
## 17 17 0.03590067        0.005619800
## 18 18 0.03528441        0.005665494
## 19 19 0.03431457        0.005781265
## 20 20 0.03357644        0.005407635
## 
## $rms_cca_distance
##     p       mean standard_deviation
## 1   1 0.77898409         0.21144206
## 2   2 0.09929341         0.01676104
## 3   3 0.11425316         0.01486931
## 4   4 0.36109056         0.11146296
## 5   5 0.18136802         0.02426402
## 6   6 0.20691420         0.02761955
## 7   7 0.33450961         0.06406051
## 8   8 0.26822993         0.03274549
## 9   9 0.28837662         0.02927741
## 10 10 0.35474015         0.04554384
## 11 11 0.36936282         0.02165180
## 12 12 0.36588766         0.02286532
## 13 13 0.39159250         0.02562691
## 14 14 0.42006682         0.03257720
## 15 15 0.43432754         0.02612058
## 16 16 0.43173962         0.02503497
## 17 17 0.44554059         0.02724136
## 18 18 0.45769065         0.02263653
## 19 19 0.45797849         0.02543545
## 20 20 0.46462122         0.02715766
## 
## $n_principal_paths_candidates
## [1]  3  6  9 13 17
## 
## attr(,"class")
## [1] "treefit"treefit::treefit() returns a treefit object
that summarizes the analysis of Treefit. We will explain how to
interpret the results in the next section. For now, we may focus on
learning how to use Treefit.
As we will see later, it is helpful to visualize the results using
plot(). By executing plot(fit.tree_like), we
can obtain the following two user-friendly visual plots, which makes it
easier to interpret the results of the Treefit analysis.
We can analyze the second toy data in the same manner.
fit.less_tree_like <- treefit::treefit(list(expression=star.less_tree_like),
                                       name="less-tree-like")
# Save the analytsis result to use other tutorials.
saveRDS(fit.less_tree_like, "fit.less_tree_like.rds")
fit.less_tree_like## $name
## [1] "less-tree-like"
## 
## $max_cca_distance
##     p       mean standard_deviation
## 1   1 0.75532883         0.23069649
## 2   2 0.35236514         0.09143309
## 3   3 0.30003965         0.06796421
## 4   4 0.24836894         0.06388260
## 5   5 0.22213817         0.04977027
## 6   6 0.20287189         0.04913655
## 7   7 0.18547705         0.04245573
## 8   8 0.17042069         0.04119185
## 9   9 0.15001861         0.03634214
## 10 10 0.14416092         0.03519550
## 11 11 0.13308743         0.03299340
## 12 12 0.12985405         0.03275815
## 13 13 0.12264250         0.02967583
## 14 14 0.11368085         0.02506860
## 15 15 0.10967418         0.02414870
## 16 16 0.10694768         0.02371396
## 17 17 0.10109370         0.02128847
## 18 18 0.09645739         0.02043134
## 19 19 0.09394676         0.01985948
## 20 20 0.09116601         0.01810588
## 
## $rms_cca_distance
##     p      mean standard_deviation
## 1   1 0.7553288         0.23069649
## 2   2 0.4777216         0.12472563
## 3   3 0.5327589         0.09643590
## 4   4 0.5526816         0.08393091
## 5   5 0.6070735         0.06644539
## 6   6 0.6236694         0.04724462
## 7   7 0.6213101         0.06284605
## 8   8 0.6180533         0.04591068
## 9   9 0.6081902         0.05514809
## 10 10 0.6050830         0.04816870
## 11 11 0.6076264         0.05041416
## 12 12 0.6095183         0.04593843
## 13 13 0.6055770         0.03980898
## 14 14 0.6052519         0.04065391
## 15 15 0.6110217         0.04076162
## 16 16 0.6159176         0.03789853
## 17 17 0.6150088         0.03996586
## 18 18 0.6109906         0.04342939
## 19 19 0.6138017         0.04155609
## 20 20 0.6110423         0.03907852
## 
## $n_principal_paths_candidates
## [1]  3 11 15 19
## 
## attr(,"class")
## [1] "treefit"Before interpreting the previous results, we briefly summarize the process of the Treefit analysis that consists of the following three steps.
First, Treefit repeatedly “perturbs” the input data (i.e., adds some small noise to the original row count data or normalized expression data) in order to produce many slightly different datasets that may have been acquired in the biological experiment.
Second, for each dataset, Treefit calculates a distance matrix that represents the dissimilarities between sample cells and then constructs a tree from each distance matrix. The current version of Treefit computes a minimum spanning tree (MST) that has been widely used for trajectory inference.
Finally, Treefit evaluates the goodness-of-fit between the data and tree models. The underlying idea of this method is that the structure of trees inferred from tree-like data tends to have high robustness to noise, compared to non-tree-like data. Therefore, Treefit measures the mutual similarity between estimated trees in order to check the stability of the tree structures. To this end, Treefit constructs a p-dimensional subspace that extracts the main features of each tree structure and then measuring mutual similarities between the subspaces by using a special type of metrics called the Grassmann distance. In principle, when the estimated trees are mutually similar in their structure, the mean and standard deviation (SD) of the Grassmann distance are small.
Although the word “Grassmann distance” may sound so unfamiliar to
some readers, the concept appears in different disguises in various
practical contexts. For example, the Grassmann distance has a close
connection to canonical correlation analysis (CCA). Treefit provides two
Grassmann distances $max_cca_distance and
$rms_cca_distance that can be used for different purposes
as we now explain.
The Treefit analysis using the first Grassmann distance
$max_cca_distance (shown in the left panel of Figure 5)
tells us the goodness-of-fit between data and tree models. In principle,
as mentioned earlier, if the mean and SD of
$max_cca_distance are small, then this means that the
estimated trees are mutually similar in their structure. As can be
observed, the distance changes according to the dimensionality p
of the feature space, but $max_cca_distance has the
property that the value decreases monotonically as p increases
for any datasets.
Comparing the Treefit results for the two datasets, we see that the mean Grassmann distance for the first data does not fall below the second one regardless of the value of p and that the SD of the Grassmann distance for the first data is very small compared to the second data. These results imply that the estimated tree structures are very robust to noise in the first case but not in the second case. Thus, Treefit has verified that the first data are highly tree-like while the second data are not.
The Treefit analysis using the other Grassmann distance
$rms_cca_distance (shown in the right panel of Figure 5) is
useful to infer the number of “principal paths” in the best-fit tree.
From a biological perspective, this analysis can be used to discover a
novel or unexpected cell type from single-cell gene expression (for
details, see vignette("working-with-seurat")).
Unlike the previous Grassmann distance, the mean value of
$rms_cca_distance can fluctuate depending on the value of
p. Interestingly, we can predict the number of principal paths in
the best-fit tree by exploring for which p the distance value
reaches “the bottom of a valley” (i.e., attains a local minimum).
More precisely, when $rms_cca_distance attains a local
minimum at a certain p, the value p+1 indicates the number
of principal paths in the best-fit tree.
$n_principal_paths_candidates has these p+1 values.
We don’t need to calculate them manually. When Treefit produces a plot
having more than one valleys, the smallest p is usually most
informative for the prediction. The smallest p+1 value can be
obtained by $n_principal_paths_candidates[1].
Comparing the Treefit results for the two datasets, we first see that both plots attains a local minimum at p=2. This means that for both datasets the best-fit tree has p+1=3 principle paths, which is correct because both were generated from the same star tree with three arms. Another important point to be made is that the SD of the Grassmann distance for the first data is very small at p=2 compared to that for the second data; in other words, Treefit made this prediction more confidently for the first dataset than for the second one. This result is reasonable because the first dataset is much less noisy than the second one. Thus, Treefit has correctly determined the number of principal paths in the underlying tree together with the goodness-of-fit for each dataset.