Type: Package
Title: Data Leakage Detection Tools for Machine Learning
Version: 0.1.0
Description: Provides utilities to detect common data leakage patterns including train/test contamination, temporal leakage, and data duplication, enhancing model reliability and reproducibility in machine learning workflows. Generates diagnostic reports and visual summaries to support data validation. Methods based on best practices from Hastie, Tibshirani, and Friedman (2009, ISBN:978-0387848570).
Imports: ggplot2, arrow, data.table, digest, htmltools, openxlsx, readxl, stringr, workflows, jsonlite
Suggests: testthat (≥ 3.0.0), caret, mlr3, tidymodels, knitr, rmarkdown
License: MIT + file LICENSE
Encoding: UTF-8
RoxygenNote: 7.3.3
VignetteBuilder: knitr
NeedsCompilation: no
Packaged: 2025-10-22 08:43:45 UTC; Isabella
Author: Cheryl Isabella Lim [aut, cre]
Maintainer: Cheryl Isabella Lim <cheryl.academic@gmail.com>
Repository: CRAN
Date/Publication: 2025-10-26 18:50:02 UTC

leakr: Data Leakage Detection for Machine Learning in R

Description

leakr: Data Leakage Detection for Machine Learning in R

Details

The leakr package provides tools to automatically detect common data leakage patterns in machine learning workflows for tabular data. It identifies train/test contamination, target leakage, and duplicate rows with clear diagnostic reports and visualisations.

Key Features

Main Functions

Built-in Detectors

Data Compatibility

Accepts data.frame, tibble, and data.table objects.

Quick Start

# Audit a dataset for leakage
library(leakr)
report <- leakr_audit(my_data, target = "outcome")

# View summary of issues found
leakr_summarise(report)

# Create diagnostic plots
leakr_plot(report)

Author(s)

Maintainer: Cheryl Isabella Lim cheryl.academic@gmail.com

See Also


Initialise built-in detectors

Description

Initialise built-in detectors

Usage

.onLoad(libname, pkgname)

Enhanced column name cleaning with better robustness

Description

Enhanced column name cleaning with better robustness

Usage

clean_column_names(names)

Arguments

names

Character vector of column names

Value

Cleaned column names


Enhanced report compilation with numeric severity scores

Description

This function compiles a report with enhanced sorting, severity scoring, and detailed metadata, including configuration information.

Usage

compile_report(
  results,
  audit_data,
  config,
  show_config = FALSE,
  top_n = 10,
  report = "default"
)

Arguments

results

A list containing detection results.

audit_data

The audit data used for the report.

config

Configuration settings, including whether to use numeric severity scores.

show_config

Logical, whether to display the configuration used for report generation. Defaults to FALSE.

top_n

Numeric, the number of top results to display in the report. Defaults to 10.

report

A string indicating the type of report to generate. Defaults to "default".

Value

A leakr_report object containing the summary, evidence, and metadata for the report.


Enhanced date detection handling multiple formats and data types

Description

Enhanced date detection handling multiple formats and data types

Usage

detect_and_convert_dates_enhanced(data, verbose)

Arguments

data

Input data.frame

verbose

Whether to show messages

Value

data.frame with converted dates


Detect file format from extension and content

Description

Detect file format from extension and content

Usage

detect_file_format(file_path, verbose = TRUE)

Arguments

file_path

Path to the file

verbose

Whether to show detection messages

Value

Character string indicating detected format


Registry-based Detector System

Description

This section of the package manages a registry for various data leakage detectors. Detectors are stored in the .detector_registry environment and are accessible by name. The system allows for easy registration of detectors, providing their descriptions and registration times. Detectors can be queried by name or listed.

Usage

.detector_registry

Format

An object of class environment of length 2.


Determine risk level and CSS class from severity counts.

Description

Determine risk level and CSS class from severity counts.

Usage

determine_risk_level(severity_counts)

Arguments

severity_counts

Named integer vector of severity frequencies.

Value

List with 'level' and CSS 'class'.


Helper function to return an empty snapshot info dataframe

Description

Helper function to return an empty snapshot info dataframe

Usage

empty_snapshot_info()

Value

Empty data.frame with correct structure


Export data with consistent messaging

Description

Export data with consistent messaging

Usage

export_data_internal(data, file_path, format, verbose, ...)

Arguments

data

Data.frame to export

file_path

Output file path

format

Output format

verbose

Whether to show messages

...

TODO: Add description

Value

Path to exported file


Format detector names for display.

Description

Format detector names by converting them to title case and separating words by spaces.

Usage

format_detector_name(detector_name)

Arguments

detector_name

A string to format, typically a detector name with underscores.

Value

A title-cased, space-separated string.


Generate diagnostic plots for a leakr_report

Description

Generate diagnostic plots for a leakr_report

Usage

generate_diagnostic_plots(report)

Arguments

report

TODO: Document Generate diagnostic plots for a leakr_report

Value

A named list of ggplot objects (currently empty stub)


Generate evidence section with format-specific handling and DRY logic.

Description

Generate evidence section with format-specific handling and DRY logic.

Usage

generate_evidence_section(report, format)

Arguments

report

TODO: Document

format

TODO: describe

Value

Formatted evidence section string.


Report generator

Description

Generate an executive summary text for the leakage audit report.

Usage

generate_executive_summary_text(report)

Arguments

report

A 'leakr_report' object containing summarized issues.

Value

Formatted summary string (Markdown/HTML-friendly).


Generate detailed issues section with output formatting and truncation.

Description

Generate detailed issues section with output formatting and truncation.

FIX ME

Usage

generate_issues_section(report, format)

Arguments

report

TODO: Document

format

TODO: Document

Value

Formatted issues section string.


Generate actionable recommendations based on report findings.

Description

This function generates actionable recommendations based on the findings in a leakr_report object.

Usage

generate_recommendations(report)

Arguments

report

A leakr_report object containing the summary of issues and metadata.

Value

A character vector of recommendations.

Examples

## Not run: 
# Requires a leakr_report object
report <- leakr_audit(iris, target = "Species")
recommendations <- generate_recommendations(report)

## End(Not run)


Format recommendations for output.

Description

Format recommendations for output.

Usage

generate_recommendations_section(report, format)

Arguments

report

A leakr_report object.

format

TODO: Add description

Value

Formatted recommendation section string.


Get detector information

Description

Retrieves information about detectors, optionally filtering by the detector name.

Usage

get_detector_info(name = NULL)

Arguments

name

Optional detector name. If NULL, returns info for all detectors.

Value

A list with detector information, including description and registration date.

Examples

# Get information for all detectors
get_detector_info()

# Get information for specific detectors that actually exist
get_detector_info("file_format")


Null-coalescing operator for clean default value handling

Description

Null-coalescing operator for clean default value handling

Usage

x %||% y

Arguments

x

First value to check

y

Fallback value if x is NULL

Value

x if not NULL, otherwise y


Import CSV files with robust parsing

Description

Import CSV files with robust parsing

Usage

import_csv(file_path, encoding, verbose, ...)

Arguments

file_path

Path to CSV file

encoding

Character encoding

verbose

Whether to show messages

...

TODO: Add description

Value

data.frame


Import Excel files with enhanced sheet support

Description

Import Excel files with enhanced sheet support

Usage

import_excel(file_path, sheet, verbose, ...)

Arguments

file_path

Path to Excel file

sheet

Sheet name or number

verbose

Whether to show messages

...

TODO: Add description

Value

data.frame


Import JSON files with better structure handling

Description

Import and process JSON files, converting them into a standardized data.frame.

Usage

import_json(file_path, verbose = FALSE, ...)

Arguments

file_path

Path to the JSON file.

verbose

Logical flag indicating whether to show progress messages (default is FALSE).

...

Additional arguments passed to jsonlite::fromJSON().

Value

A data.frame with the content from the JSON file, flattened.


Import Parquet files

Description

Import and process Parquet files into a standardized data.frame.

Usage

import_parquet(file_path, verbose = FALSE, ...)

Arguments

file_path

Path to the Parquet file.

verbose

Logical flag indicating whether to show progress messages (default is FALSE).

...

Additional arguments passed to arrow::read_parquet().

Value

A data.frame with the content from the Parquet file.


Import RDS files with validation

Description

Import RDS files with validation

Usage

import_rds(file_path, verbose, ...)

Arguments

file_path

Path to RDS file

verbose

Whether to show messages

...

TODO: Add description

Value

data.frame


Import TSV files with robust parsing

Description

Import TSV files with robust parsing

Usage

import_tsv(file_path, encoding, verbose, ...)

Arguments

file_path

Path to TSV file

encoding

Character encoding

verbose

Whether to show messages

...

TODO: Add description

Value

data.frame


Audit dataset for data leakage

Description

This function audits a dataset for potential data leakage, running a series of predefined detectors and generating a comprehensive report with detailed findings.

Usage

leakr_audit(
  data,
  target = NULL,
  split = NULL,
  id = NULL,
  detectors = NULL,
  config = list()
)

Arguments

data

The dataset to be audited (data frame or tibble).

target

The target variable (optional). If NULL, no target variable is assumed.

split

The split variable used for training/test split (optional). If NULL, no split is assumed.

id

The unique identifier for each row (optional). If NULL, no id is used.

detectors

A vector of detector names to run (optional). If NULL, all available detectors will be used.

config

A list of configuration parameters for the audit. Defaults to an empty list.

Value

A leakr_report object containing the audit results, including summary, evidence, and metadata.

Examples


# Basic audit on iris dataset
report <- leakr_audit(iris, target = "Species")
print(report)



Create data snapshots with improved metadata handling

Description

Save data and metadata for reproducible leakage analysis with optimised performance.

Usage

leakr_create_snapshot(
  data,
  output_dir = file.path(tempdir(), "leakr_snapshots"),
  snapshot_name = NULL,
  metadata = list(),
  sample_for_hash = TRUE
)

Arguments

data

Data.frame to snapshot

output_dir

Directory for snapshot files

snapshot_name

Name for this snapshot

metadata

Additional metadata to store

sample_for_hash

Whether to sample large datasets for faster hashing

Value

Path to snapshot directory


Export data in various formats

Description

Save processed data to different file formats with consistent behaviour.

Usage

leakr_export_data(data, file_path, format = "csv", verbose = TRUE, ...)

Arguments

data

Data.frame to export

file_path

Output file path

format

Output format: "csv", "excel", "rds", "json", "parquet"

verbose

Whether to show export messages

...

TODO: Add description

Value

Path to exported file (invisibly)


Convert caret training objects to standard format

Description

Extract data from caret train objects for leakage analysis.

Usage

leakr_from_caret(train_obj, original_data = NULL, target_name = "target")

Arguments

train_obj

caret train object

original_data

Original training data (if available)

target_name

Custom name for target variable (default: "target")

Value

List with data and metadata


Convert mlr3 Task objects to standard format

Description

Extract data from mlr3 Task objects for leakage analysis.

Usage

leakr_from_mlr3(task, include_target = TRUE)

Arguments

task

mlr3 Task object (TaskClassif, TaskRegr, etc.)

include_target

Whether to include target variable in output

Value

List with data, target, and metadata


Convert tidymodels workflow to standard format

Description

Extract data from tidymodels workflows for leakage analysis.

Usage

leakr_from_tidymodels(workflow, data)

Arguments

workflow

tidymodels workflow object

data

Original training data

Value

List with data and metadata


Import data from various sources for leakage analysis

Description

Flexible data import function supporting multiple formats with automatic format detection and preprocessing for leakage analysis.

Usage

leakr_import(
  source,
  format = "auto",
  preprocessing = list(),
  encoding = "UTF-8",
  sheet = NULL,
  verbose = TRUE,
  ...
)

Arguments

source

Path to data file, data.frame, or other supported object.

format

Data format: "auto", "csv", "excel", "rds", "json", "parquet", "tsv". If "auto", the format will be detected from the file extension.

preprocessing

List of preprocessing options to apply after import.

encoding

Character encoding for reading files. Default is "UTF-8".

sheet

Sheet name or index to read (for Excel files). Default is NULL.

verbose

Logical indicating whether to print progress messages. Default TRUE.

...

Additional arguments passed to specific import functions.

Value

Standardised data.frame suitable for leakage analysis

A standardized data.frame suitable for leakage analysis.


List available snapshots with enhanced information

Description

Display comprehensive information about available data snapshots.

Usage

leakr_list_snapshots(
  snapshots_dir = file.path(tempdir(), "leakr_snapshots"),
  include_metadata = TRUE
)

Arguments

snapshots_dir

Directory containing snapshots

include_metadata

Whether to load detailed metadata for each snapshot

Value

Data.frame with snapshot information


Load data snapshot with enhanced validation

Description

Restore data from a previously created snapshot with integrity checking.

Usage

leakr_load_snapshot(snapshot_path, format = "rds", verify_integrity = TRUE)

Arguments

snapshot_path

Path to snapshot directory

format

Format to load: "rds" (recommended), "csv"

verify_integrity

Whether to verify data integrity using hash

Value

Data.frame from snapshot


Plot leakage detection results

Description

Plot leakage detection results

Usage

leakr_plot(x, ...)

Arguments

x

Results from leakr_audit

...

TODO: Add description Plot leakage detection results

Value

A ggplot object


Fast import with default preprocessing

Description

Minimal quick import for typical user workflows. Uses leakr_import internally.

Usage

leakr_quick_import(source, ...)

Arguments

source

File path or data.frame

...

TODO: Add description

Value

Standardised data.frame


Enhanced summarise with better formatting

Description

This function provides a formatted summary of the leakage audit report. It displays a summary of the leakage issues, including the severity and top issues detected. Optionally, it can also display configuration details used for the audit.

Usage

leakr_summarise(
  report,
  top_n = 10,
  show_config = FALSE,
  config = NULL,
  audit_data = NULL,
  detectors = NULL,
  libname = NULL,
  pkgname = NULL
)

Arguments

report

A leakr_report object from leakr_audit().

top_n

Maximum number of issues to display in the summary. Defaults to 10.

show_config

Whether to display the configuration details used for the audit. Defaults to FALSE.

config

(Optional) A configuration list. This argument is not used directly in the function, but is referenced in the report metadata. Defaults to NULL.

audit_data

(Optional) The data used for auditing. This argument is not used directly in the function, but is part of the report metadata. Defaults to NULL.

detectors

(Optional) A vector of detectors used for the audit. This argument is not used directly in the function but is part of the report metadata. Defaults to NULL.

libname

(Optional) The name of the library. This is included for internal package functionality.

pkgname

(Optional) The name of the package. This is included for internal package functionality.

Value

An invisible data.frame summarizing the top n issues detected.

Examples


# Create and summarise a report
report <- leakr_audit(iris, target = "Species")
leakr_summarise(report, top_n = 5)



List Registered Detectors

Description

Returns the names of all detectors currently registered in the system. This is useful for checking which detectors are available.

Usage

list_registered_detectors()

Value

A character vector containing the names of all registered detectors.

Examples

list_registered_detectors()


Create a new temporal detector

Description

Create a new temporal detector

Usage

new_temporal_detector(time_col, lookahead_window = 1)

Arguments

time_col

Character. Name of the time column

lookahead_window

Numeric. Lookahead window size (default 1) Create a new temporal detector

Value

A temporal_detector object

A temporal_detector object


Create a new train-test detector

Description

Create a new train-test detector

Usage

new_train_test_detector(threshold = 0.1)

Arguments

threshold

TODO: Document Create a new train-test detector

Value

A train_test_detector object


Plot a detector_result object

Description

Plot a detector_result object

Plot a detector_result object

Usage

## S3 method for class 'detector_result'
plot(x, palette = NULL, ...)

Arguments

x

TODO: Document

palette

TODO: Document

...

TODO: Document

Value

A ggplot object, invisibly. Printed if interactive

A ggplot object, invisibly. Printed if interactive


Plot a udld_report object

Description

This function generates a bar plot of leakage issues detected by different detectors. The plot displays the count of issues by severity level for each detector in a udld_report object.

Usage

## S3 method for class 'udld_report'
plot(x, palette = NULL, ...)

Arguments

x

A udld_report object. This object contains the detectors and their associated issues.

palette

Optional. A ggplot2 discrete palette for coloring the bars based on severity.

...

Additional arguments passed to ggplot2 functions or other methods. These are typically used for customizing the plot further.

Value

A ggplot object, invisibly. The plot is printed if the session is interactive.


Enhanced data preparation with robust preprocessing

Description

This function performs robust data preprocessing and prepares the data for leakage detection. It handles intelligent sampling, adjusts for the presence of a target variable, and structures the data for further audit and analysis.

Usage

prepare_audit_data(data, target, split, id, config)

Arguments

data

A data frame containing the dataset to be audited.

target

The name of the target variable (optional). Used for stratified sampling if provided.

split

A vector or a column name specifying the data split (e.g., training/test split).

id

The unique identifier column for the dataset (optional).

config

A list of configuration settings, including sample size and other audit parameters.

Value

A list of class audit_data containing preprocessed data along with metadata, such as:

Examples

## Not run: 
audit_data <- prepare_audit_data(data, target = "target_column",
                                 split = "train_test_split",
                                 id = "id_column",
                                 config = list(sample_size = 50000))

## End(Not run)


Enhanced preprocessing with better performance and robustness

Description

A preprocessing function to handle common data issues, such as removing empty rows/columns, handling dates, and converting character columns to factors. This function improves data quality before further analysis.

Usage

preprocess_imported_data(data, preprocessing = list(), verbose = FALSE)

Arguments

data

Input data.frame to be preprocessed.

preprocessing

A list of preprocessing options, such as removing empty rows or handling dates.

verbose

Logical flag indicating whether to show progress messages (default is FALSE).

Value

A preprocessed data.frame.


Print method for leakr_report

Description

Print method for leakr_report

Usage

## S3 method for class 'leakr_report'
print(x, ...)

Arguments

x

leakr_report object

...

TODO: Add description


Register a new detector

Description

Register a new data leakage detector function

Usage

register_detector(name, fun, description = "")

Arguments

name

Name of the detector

fun

TODO: Add description

description

TODO: Add description

Value

Invisibly returns registration status


Run a detector on data

Description

Run a detector on data

Usage

run_detector(detector, data, split = NULL, id = NULL, config = list())

Arguments

detector

A detector object

data

Data frame to analyze

split

Split vector indicating train/test assignment (optional)

id

Optional ID column name

config

Optional configuration list

Value

A detector result object

A detector result object


Run multiple detectors on audit data

Description

This function runs multiple leakage detectors on the provided audit data and returns the results for each detector.

Usage

run_detectors(detectors, audit_data, config)

Arguments

detectors

A list of detector configurations. Each detector can be either a function or an object that contains a func field with the detector function.

audit_data

A data.frame, tibble, or data.table to audit.

config

A list of configuration settings to be passed to each detector.

Value

A list where each element contains the results of running a detector. If a detector fails, an error message is included in the result.

Examples

## Not run: 
detectors <- list(
  temporal = list(func = temporal_detector_func),
  train_test = new_train_test_detector()
)
results <- run_detectors(detectors, audit_data = iris, config = list(sample_size = 50000))

## End(Not run)


Stratified sampling helper

Description

This function performs stratified sampling based on the provided target vector. The sampling is done proportionally to the distribution of values in the target vector.

Usage

stratified_sample(target_vec, n_sample)

Arguments

target_vec

A vector representing the target variable used for stratification. The function will sample from each class (level) proportionally.

n_sample

The total number of samples to draw.

Value

A vector of indices representing the sampled observations.


Robust data validation and preprocessing

Description

This function performs data validation and preprocessing for audit purposes. It checks the validity of the input data, ensures that the target and ID columns exist, and handles empty or problematic columns.

Usage

validate_and_preprocess_data(data, target, split, id)

Arguments

data

A data frame, tibble, or data table to be validated and preprocessed.

target

The name of the target column, which should be present in the data. If NULL, no target validation is performed.

split

A vector specifying the split column, which will be checked in the data. If NULL, no split validation is performed.

id

The name of the ID column, which should be present in the data. If NULL, no ID validation is performed.

Value

The validated and preprocessed data.

Examples

## Not run: 
# Example data
data <- data.frame(target = rnorm(100), id = 1:100)
target <- "target"
id <- "id"
validated_data <- validate_and_preprocess_data(data, target, NULL, id)

## End(Not run)


Enhanced data validation with better error messages

Description

Enhanced data validation with better error messages

Usage

validate_imported_data(data, source)

Arguments

data

Input data.frame

source

Source identifier for error messages

Value

TRUE (invisibly) if validation passes