<- TDA::torusUnif(n = 200, a = 1, c = 2)
x <- TDA::alphaComplexDiag(x, maxdimension = 2)
pd plot(pd$diagram, asp = 1)
The Pitch
Topological data analysis (TDA) is a pretty mature discipline at this point, but in the last several years its assimilation into machine learning (ML) has really taken off. Based on our experience, the plurality of experimental TDA tools are written in Python, and naturally Python is home to most of these applications.
That’s not to say that there are no R packages for TDA-ML. {TDAkit}, {TDApplied}, and others provide tools for specific self-contained analyses and could be used together in larger projects. As with the broader R ecosystem, though, their integration can require some additional work. The combination of several low-level libraries and compounding package dependencies has also made this toolkit fragile, with several packages temporarily or permanently archived.
Meanwhile, the Tidymodels package collection has enabled a new generation of users, myself (JCB) included, to build familiarity and proficiency with conventional ML. By harmonizing syntax and smoothing pipelines, Tidymodels makes it quick and easy to adapt usable code to new data types, pre-processing steps, and model families. By using wrappers and extractors, it also allows seasoned users to extend their work beyond its sphere of convenience.
We therefore think that Tidymodels is an ideal starting point for a more sustained and interoperable collection for TDA-ML in R. Since much of the role of TDA in ML has been to extract and vectorize features from spatial, image, and other high-dimensional data, we present an extension to {recipes} for just this purpose. Assembling a comprehensive, general-purpose toolkit is a long-term project. Our contribution is meant to spur that project on.
The Packages
We present two packages: {TDAvec}, an interface to several efficient feature vectorizations, and {tdarec}, a {recipes} + {dials} extension that integrates these vectorizations into Tidymodels. First, though, we need to introduce persistent homology and how it can be computed with R.
Computations of persistent homology
In predictive modeling, persistent homology (PH) is the workhorse of TDA.1 In a nutshell, it measures topological patterns in data—most commonly clusters and enclosures—by the range of resolutions through which they persist. Several packages interface to lower-level libraries that compute PH for various data structures—all accept point clouds and distance matrices, but other structures are accepted where noted:
- {TDA} (Fasy et al. 2022) interfaces to the Dionysus, PHAT, and GUDHI libraries and also accepts functions and rasters;
- {ripserr} (Wadhwa et al. 2025), a spinoff from {TDAstats}, interfaces to the Ripser and Cubical Ripser C++ libraries and also accepts rasters;
- {TDApplied} (Brown and Farivar 2024) calls {TDA} and {TDAstats} but also interfaces with Python Ripser; and
- {rgudhi} (Stamm 2023) interfaces to Python GUDHI.
Here is an example using the most mature package, {TDA}, to compute the PH for an alpha filtration of a sample from a torus:
We rely for now on {ripserr} to compute PH, but a near-term upgrade will incorporate {TDA} as well.
Vectorizations of persistent homology
Vectorization is a crucial step to bridge TDA and ML. Numerous methods have been proposed, and research shows that ML performance on a specific task can depend strongly on the chosen method. Therefore, it is highly desirable to compare several approaches to a given problem and select the one found to be best-suited to it.
Although most proposed vectorization methods are available in free software, they are scattered across various R and Python packages. This complicates the comparison process, as researchers must search for available methods and adapt their code to the interface of each specific implementation. The goal of {TDAvec} (Luchinsky and Islambekov 2025) is to address this issue.
We (AL and UI) have consolidated all currently available vectorizations (of which we are aware) and implemented them in a single R library. Some of these vectorizations are parameter-free, but others rely on one or several hyperparameters. For ease of comparison, we also use a consistent interface, with a common camelcase naming convention, e.g. computePersistenceLandscape()
. Additionally, several new vectorization methods developed by our group are also available within the TDAvec framework.
Here, for example, is how to compute the tropical coordinates proposed by Kališnik (2019) and a coarse vectorization of our own persistence block transformation (Chan et al. 2022) for the degree-1 layer of the persistence diagram above:
library(TDAvec)
computeTropicalCoordinates(pd$diagram, homDim = 1)
F1 F2 F3 F4 F5 F6
0.8904834 1.7756440 2.2856840 2.7673257 10.1883696 7.0749433
F7
210.0180135
<- seq(0, 1.5, .5)
xy_seq computePersistenceBlock(
$diagram, homDim = 2,
pdxSeq = xy_seq, ySeq = xy_seq
)
[1] 0.3539504 0.2527690 0.0000000 0.2708638 51.0522531 62.6966904 0.0000000
[8] 66.3569136 84.8988826
Notably, all vectorizations are written in C++ and exposed to R using {Rcpp}. We also utilize the Armadillo package for various matrix operations, which makes all computations extremely fast and efficient.
Tidy machine learning with persistent homology
{recipes} is the pre-processing arm of Tidymodels; mostly it provides step_*()
functions that pipe together as recipe specifications, to later be applied directly to data or built into workflows. {dials} provides tuning functions for these steps as well as for model families provided by {parsnip}.
{tdarec} provides two primary families of steps:
- Steps to calculate persistent homology from data, which share the naming pattern
step_pd_*()
for persistence diagram. These steps rely on the engines above and are scoped according to the underlying mathematical object encoded in the data: Currently the {ripserr} engine handles point clouds (encoded as coordinate matrices or distance matrices) and rasters (encoded as numerical arrays). These steps accept list-columns of data sets and return list-columns of persistence diagrams. - Steps to transform and vectorize persistence diagrams, which share the naming pattern
step_vpd_*()
for vectorized persistence diagram. These steps rely on {TDAvec} and are scoped as there by transformation. They accept list-columns of persistence diagrams and return numeric columns (sometimes flattened from matrix output, e.g. multi-level persistence landscapes) that can be used by most predictive model families.
Here is how to incorporate PH (using the default Vietoris–Rips filtration) and the persistence block transformation above into a pre-processing recipe:
suppressMessages(library(tdarec))
<- tibble(source = "torus", sample = list(x))
dat recipe(~ sample, data = dat) |>
step_pd_point_cloud(sample, max_hom_degree = 2) |>
step_pd_degree(sample_pd, hom_degrees = 2) |>
step_vpd_persistence_block(
sample_pd_2,hom_degree = 2, xseq = xy_seq
-> rec
) |>
rec prep(training = dat) |>
bake(new_data = dat)
# A tibble: 1 × 299
sample sample_pd_2 sample_pd_2_pb_1 sample_pd_2_pb_2 sample_pd_2_pb_3
<list> <list> <dbl> <dbl> <dbl>
1 <dbl [200 × 3]> <PHom> 0 0 39.7
# ℹ 294 more variables: sample_pd_2_pb_4 <dbl>, sample_pd_2_pb_5 <dbl>,
# sample_pd_2_pb_6 <dbl>, sample_pd_2_pb_7 <dbl>, sample_pd_2_pb_8 <dbl>,
# sample_pd_2_pb_9 <dbl>, sample_pd_2_pb_10 <dbl>, sample_pd_2_pb_11 <dbl>,
# sample_pd_2_pb_12 <dbl>, sample_pd_2_pb_13 <dbl>, sample_pd_2_pb_14 <dbl>,
# sample_pd_2_pb_15 <dbl>, sample_pd_2_pb_16 <dbl>, sample_pd_2_pb_17 <dbl>,
# sample_pd_2_pb_18 <dbl>, sample_pd_2_pb_19 <dbl>, sample_pd_2_pb_20 <dbl>,
# sample_pd_2_pb_21 <dbl>, sample_pd_2_pb_22 <dbl>, …
The code chunk uses the additional step step_pd_degree()
to extract degree-specific layers from multi-degree persistence diagrams; in this case, we are interested in vectorizing only 2-dimensional features. Despite this, we must also specify the degree of the features we want in the persistence block step.
The Potential
Many methods remain to be built into these tools, which will only reach their full potential through user feedback. Two issues in particular, additional vectorizations for {TDAvec} and additional engines for {tdarec}, may remain open—or, ahem, persist—into the foreseeable future. We welcome bug reports, feature requests, and code contributions from the community!
References
Footnotes
Exploratory modeling is another story.↩︎