--- name: MachineLearning topic: Machine Learning & Statistical Learning maintainer: Torsten Hothorn email: Torsten.Hothorn@R-project.org version: 2023-07-20 source: https://github.com/cran-task-views/MachineLearning/ --- Several add-on packages implement ideas and methods developed at the borderline between computer science and statistics - this field of research is usually referred to as machine learning. The packages can be roughly structured into the following topics: - *Neural Networks and Deep Learning* : Single-hidden-layer neural network are implemented in package `r pkg("nnet", priority = "core")` (shipped with base R). Package `r pkg("RSNNS")` offers an interface to the Stuttgart Neural Network Simulator (SNNS). Packages implementing deep learning flavours of neural networks include `r pkg("deepnet")` (feed-forward neural network, restricted Boltzmann machine, deep belief network, stacked autoencoders), `r pkg("RcppDL")` (denoising autoencoder, stacked denoising autoencoder, restricted Boltzmann machine, deep belief network) and `r pkg("h2o")` (feed-forward neural network, deep autoencoders). An interface to [tensorflow](http://www.tensorflow.org) is available in `r pkg("tensorflow")`. The `r pkg("torch")` package implements an interface to the [libtorch library](https://pytorch.org/). Prediction uncertainty can be quantified by the ENNreg evidential regression neural network model implemented in `r pkg("evreg")`. - *Recursive Partitioning* : Tree-structured models for regression, classification and survival analysis, following the ideas in the CART book, are implemented in `r pkg("rpart", priority = "core")` (shipped with base R) and `r pkg("tree")`. Package `r pkg("rpart")` is recommended for computing CART-like trees. A rich toolbox of partitioning algorithms is available in [Weka](http://www.cs.waikato.ac.nz/~ml/weka/), package `r pkg("RWeka")` provides an interface to this implementation, including the J4.8-variant of C4.5 and M5. The `r pkg("Cubist")` package fits rule-based models (similar to trees) with linear regression models in the terminal leaves, instance-based corrections and boosting. The `r pkg("C50")` package can fit C5.0 classification trees, rule-based models, and boosted versions of these. `r pkg("pre")` can fit rule-based models for a wider range of response variable types.\ Two recursive partitioning algorithms with unbiased variable selection and statistical stopping criterion are implemented in package `r pkg("party")` and `r pkg("partykit")`. Function `ctree()` is based on non-parametric conditional inference procedures for testing independence between response and each input variable whereas `mob()` can be used to partition parametric models. Extensible tools for visualizing binary trees and node distributions of the response are available in package `r pkg("party")` and `r pkg("partykit")` as well. Partitioning of mixed-effects models (GLMMs) can be performed with package `r pkg("glmertree")`; partitioning of structural equation models (SEMs) can be performed with package `r pkg("semtree")`.\ Graphical tools for the visualization of trees are available in package `r pkg("maptree")`.\ Partitioning of mixture models is performed by `r pkg("RPMM")`.\ Computational infrastructure for representing trees and unified methods for prediction and visualization is implemented in `r pkg("partykit")`. This infrastructure is used by package `r pkg("evtree")` to implement evolutionary learning of globally optimal trees. Survival trees are available in various packages. Trees for subgroup identification with respect to heterogenuous treatment effects are available in packages `r pkg("partykit")`, `r pkg("model4you")`, `r pkg("dipm")`, `r pkg("quint")`, `pkg("SIDES")`, `pkg("psica")`, and `pkg("MrSGUIDE")` (and probably many more). - *Random Forests* : The reference implementation of the random forest algorithm for regression and classification is available in package `r pkg("randomForest", priority = "core")`. Package `r pkg("ipred")` has bagging for regression, classification and survival analysis as well as bundling, a combination of multiple models via ensemble learning. In addition, a random forest variant for response variables measured at arbitrary scales based on conditional inference trees is implemented in package `r pkg("party")`. `r pkg("randomForestSRC")` implements a unified treatment of Breiman's random forests for survival, regression and classification problems. Quantile regression forests `r pkg("quantregForest")` allow to regress quantiles of a numeric response on exploratory variables via a random forest approach. For binary data, The `r pkg("varSelRF")` and `r pkg("Boruta")` packages focus on variable selection by means for random forest algorithms. In addition, packages `r pkg("ranger")` and `r pkg("Rborist")` offer R interfaces to fast C++ implementations of random forests. Reinforcement Learning Trees, featuring splits in variables which will be important down the tree, are implemented in package `r pkg("RLT")`. `r pkg("wsrf")` implements an alternative variable weighting method for variable subspace selection in place of the traditional random variable sampling. Package `r pkg("RGF")` is an interface to a Python implementation of a procedure called regularized greedy forests. Random forests for parametric models, including forests for the estimation of predictive distributions, are available in packages `r pkg("trtf")` (predictive transformation forests, possibly under censoring and trunction) and `r pkg("grf")` (an implementation of generalised random forests). - *Regularized and Shrinkage Methods* : Regression models with some constraint on the parameter estimates can be fitted with the `r pkg("lars")` package. Lasso with simultaneous updates for groups of parameters (groupwise lasso) is available in package `r pkg("grplasso")`; the `r pkg("grpreg")` package implements a number of other group penalization models, such as group MCP and group SCAD. The L1 regularization path for generalized linear models and Cox models can be obtained from functions available in package `r pkg("glmpath")`, the entire lasso or elastic-net regularization path (also in `r pkg("elasticnet")`) for linear regression, logistic and multinomial regression models can be obtained from package `r pkg("glmnet")`. The `r pkg("penalized")` package provides an alternative implementation of lasso (L1) and ridge (L2) penalized regression models (both GLM and Cox models). Package `r pkg("RXshrink")` can be used to generate TRACE displays that identify the extent of shrinkage with Maximum Likelihood of Minimum MSE Risk when errors are IID Normal. Semiparametric additive hazards models under lasso penalties are offered by package `r pkg("ahaz")`. Fisher's LDA projection with an optional LASSO penalty to produce sparse solutions is implemented in package `r pkg("penalizedLDA")`. The shrunken centroids classifier and utilities for gene expression analyses are implemented in package `r pkg("pamr")`. An implementation of multivariate adaptive regression splines is available in package `r pkg("earth")`. Various forms of penalized discriminant analysis are implemented in packages `r pkg("hda")` and `r pkg("sda")`. Package `r pkg("LiblineaR")` offers an interface to the LIBLINEAR library. The `r pkg("ncvreg")` package fits linear and logistic regression models under the the SCAD and MCP regression penalties using a coordinate descent algorithm. The same penalties are also implemented in the `r pkg("picasso")` package. The Lasso under non-Gaussian and heteroscedastic errors is estimated by `r pkg("hdm")`, inference on low-dimensional components of Lasso regression and of estimated treatment effects in a high-dimensional setting are also contained. Package `r pkg("SIS")` implements sure independence screening in generalised linear and Cox models. Elastic nets for correlated outcomes are available from package `r pkg("joinet")`. Robust penalized generalized linear models and robust support vector machines are fitted by package `r pkg("mpath")` using composite optimization by conjugation operator. The `r pkg("islasso")` package provides an implementation of lasso based on the induced smoothing idea which allows to obtain reliable p-values for all model parameters. Best-subset selection for linear, logistic, Cox and other regression models, based on a fast polynomial time algorithm, is available from package `r pkg("abess", priority = "core")`. - *Boosting and Gradient Descent* : Various forms of gradient boosting are implemented in package `r pkg("gbm", priority = "core")` (tree-based functional gradient descent boosting). Package `r pkg("lightgbm")` and `r pkg("xgboost")` implement tree-based boosting using efficient trees as base learners for several and also user-defined objective functions. The Hinge-loss is optimized by the boosting implementation in package `r pkg("bst")`. An extensible boosting framework for generalized linear, additive and nonparametric models is available in package `r pkg("mboost", priority = "core")`. Likelihood-based boosting for mixed models is implemented in `r pkg("GMMBoost")`. GAMLSS models can be fitted using boosting by `r pkg("gamboostLSS")`. `r pkg("adabag")` implements the classical AdaBoost algorithm with added functionality, such as variable importances. - *Support Vector Machines and Kernel Methods* : The function `svm()` from `r pkg("e1071", priority = "core")` offers an interface to the LIBSVM library and package `r pkg("kernlab", priority = "core")` implements a flexible framework for kernel learning (including SVMs, RVMs and other kernel learning algorithms). An interface to the SVMlight implementation (only for one-against-all classification) is provided in package `r pkg("klaR")`. - *Bayesian Methods* : Bayesian Additive Regression Trees (BART), where the final model is defined in terms of the sum over many weak learners (not unlike ensemble methods), are implemented in packages `r pkg("BayesTree")`, `r pkg("BART")`, and `r pkg("bartMachine")`. Bayesian nonstationary, semiparametric nonlinear regression and design by treed Gaussian processes including Bayesian CART and treed linear models are made available by package `r pkg("tgp")`. Bayesian structure learning in undirected graphical models for multivariate continuous, discrete, and mixed data is implemented in package `r pkg("BDgraph")`; corresponding methods relying on spike-and-slab priors are available from package `r pkg("ssgraph")`. Naive Bayes classifiers are available in `r pkg("naivebayes")`. - *Optimization using Genetic Algorithms* : Package `r pkg("rgenoud")` offers optimization routines based on genetic algorithms. The package `r pkg("Rmalschains")` implements memetic algorithms with local search chains, which are a special type of evolutionary algorithms, combining a steady state genetic algorithm with local search for real-valued parameter optimization. - *Association Rules* : Package `r pkg("arules")` provides both data structures for efficient handling of sparse binary data as well as interfaces to implementations of Apriori and Eclat for mining frequent itemsets, maximal frequent itemsets, closed frequent itemsets and association rules. Package `r pkg("opusminer")` provides an interface to the OPUS Miner algorithm (implemented in C++) for finding the key associations in transaction data efficiently, in the form of self-sufficient itemsets, using either leverage or lift. - *Fuzzy Rule-based Systems* : Package `r pkg("frbs")` implements a host of standard methods for learning fuzzy rule-based systems from data for regression and classification. Package `r pkg("RoughSets")` provides comprehensive implementations of the rough set theory (RST) and the fuzzy rough set theory (FRST) in a single package. - *Model selection and validation* : Package `r pkg("e1071")` has function `tune()` for hyper parameter tuning and function `errorest()` (`r pkg("ipred")`) can be used for error rate estimation. The cost parameter C for support vector machines can be chosen utilizing the functionality of package `r pkg("svmpath")`. Data splitting for crossvalidation and other resampling schemes is available in the `r pkg("splitTools")` package. Package `r pkg("nestedcv")` provides nested cross-validation for `r pkg("glmnet")` and `r pkg("caret")` models. Functions for ROC analysis and other visualisation techniques for comparing candidate classifiers are available from package `r pkg("ROCR")`. Packages `r pkg("hdi")` and `r pkg("stabs")` implement stability selection for a range of models, `r pkg("hdi")` also offers other inference procedures in high-dimensional models. - *Causal Machine Learning* : The package `r pkg("DoubleML")` is an object-oriented implementation of the double machine learning framework in a variety of causal models. Building upon the `r pkg("mlr3")` ecosystem, estimation of causal effects can be based on an extensive collection of machine learning methods. - *Other procedures* : Evidential classifiers quantify the uncertainty about the class of a test pattern using a Dempster-Shafer mass function in package `r pkg("evclass")`. The `r pkg("OneR")` (One Rule) package offers a classification algorithm with enhancements for sophisticated handling of missing values and numeric data together with extensive diagnostic functions. - *Meta packages* : Package `r pkg("tidymodels")` provides miscellaneous functions for building predictive models, including parameter tuning and variable importance measures. In a similar spirit, package `r pkg("mlr3")` offers high-level interfaces to various statistical and machine learning packages. Package `r pkg("SuperLearner")` implements a similar toolbox. The `r pkg("h2o")` package implements a general purpose machine learning platform that has scalable implementations of many popular algorithms such as random forest, GBM, GLM (with elastic net regularization), and deep learning (feedforward multilayer networks), among others. An interface to the mlpack C++ library is available from package `r pkg("mlpack")`. `r pkg("CORElearn")` implements a rather broad class of machine learning algorithms, such as nearest neighbors, trees, random forests, and several feature selection methods. Similar, package `r pkg("rminer")` interfaces several learning algorithms implemented in other packages and computes several performance measures. Package `r pkg("qeML")` provides wrappers to numerious ML R packages with a simple, convenient, and uniform interface. - *Visualisation (initially contributed by Brandon Greenwell)* The `stats::termplot()` function package can be used to plot the terms in a model whose predict method supports `type="terms"`. The `r pkg("effects")` package provides graphical and tabular effect displays for models with a linear predictor (e.g., linear and generalized linear models). Friedman's partial dependence plots (PDPs), that are low dimensional graphical renderings of the prediction function, are implemented in a few packages. `r pkg("gbm")`, `r pkg("randomForest")` and `r pkg("randomForestSRC")` provide their own functions for displaying PDPs, but are limited to the models fit with those packages (the function `partialPlot` from `r pkg("randomForest")` is more limited since it only allows for one predictor at a time). Packages `r pkg("pdp")`, `r pkg("plotmo")`, and `r pkg("ICEbox")` are more general and allow for the creation of PDPs for a wide variety of machine learning models (e.g., random forests, support vector machines, etc.); both `r pkg("pdp")` and `r pkg("plotmo")` support multivariate displays (`r pkg("plotmo")` is limited to two predictors while `r pkg("pdp")` uses trellis graphics to display PDPs involving three predictors). By default, `r pkg("plotmo")` fixes the background variables at their medians (or first level for factors) which is faster than constructing PDPs but incorporates less information. `r pkg("ICEbox")` focuses on constructing individual conditional expectation (ICE) curves, a refinement over Friedman's PDPs. ICE curves, as well as centered ICE curves can also be constructed with the `partial()` function from the `r pkg("pdp")` package. - *XAI* : Most packages and functions from the last section "Visualization" belong to the field of explainable artificial intelligence (XAI). The meta packages `r pkg("DALEX")` and `r pkg("iml")` offer different methods to interpret any model, including partial dependence, accumulated local effects, and permutation importance. Accumulated local effects plots are also directly available in `r pkg("ALEPlot")`. SHAP (from *SH*apley *A*dditive ex*P*lanations) is one of the most frequently used techniques to interpret ML models. It decomposes - in a fair way - predictions into additive contributions of the predictors. For tree-based models, the very fast TreeSHAP algorithm exists. It is shipped directly with `r pkg("h2o")`, `r pkg("xgboost")`, and `r pkg("lightgbm")`. Model-agnostic implementations of SHAP are available in additional packages: `r pkg("fastshap")` mainly uses Monte-Carlo sampling to approximate SHAP values, while `r pkg("shapr")` and `r pkg("kernelshap")` provide implementations of KernelSHAP. SHAP values of any of these packages can be plotted by the package `r pkg("shapviz")`. A port to Python's "shap" package is provided in `r pkg("shapper")`. Alternative decompositions of predictions are implemented in `r pkg("lime")` and `r pkg("iBreakDown")`. ### Links - [MLOSS: Machine Learning Open Source Software](http://www.MLOSS.org/)