When working with large eyeris
databases containing
millions of eye-tracking data points, traditional export methods can run
into memory limitations or create unwieldy files. The chunked database
export functionality in eyeris
provides an out-of-the-box
solution for handling really large eyerisdb
databases
by:
CSV
and Parquet
formats for optimal performanceThis vignette walks through how to use these features after you’ve
created an eyerisdb
database using
bidsify(db_enabled = TRUE)
.
Before using the chunked export functions, you need:
eyerisdb
database created with
bidsify(db_enabled = TRUE)
arrow
package installed (for Parquet support):
install.packages("arrow")
(arrow
is included
when installing eyeris
from CRAN)The easiest way to export your entire database is with
eyeris_db_to_chunked_files()
:
result <- eyeris_db_to_chunked_files(
bids_dir = "/path/to/your/bids/directory",
db_path = "my-project" # your database name
)
# view what was exported
print(result)
Using the eyeris_db_to_chunked_files()
function
defaults, this will: - Process 1 million rows
at a time
(i.e., the default chunk size) - Create files up to 500MB
each (i.e., the default max file size) - Export all data types found in
your database - Save files to
bids_dir/derivatives/eyerisdb_export/my-proj/
The function creates organized output files:
derivatives/eyerisdb_export/my-proj/
├── my-proj_timeseries_chunked_01.csv # Single file (< 500MB)
├── my-proj_events_chunked_01-of-02.csv # Multiple files due to size
├── my-proj_events_chunked_02-of-02.csv
├── my-proj_confounds_summary_goal_chunked_01.csv # Grouped by schema
├── my-proj_confounds_summary_stim_chunked_01.csv # Different column structure
├── my-proj_confounds_events_chunked_01.csv
├── my-proj_epoch_summary_chunked_01.csv
└── my-proj_epochs_pregoal_chunked_01-of-03.csv # Epoch-specific data
You can customize the maximum file size to create smaller, more manageable files:
# Create smaller files for easy distribution
result <- eyeris_db_to_chunked_files(
bids_dir = "/path/to/bids",
db_path = "large-project",
max_file_size_mb = 100, # 100MB files instead of 500MB
chunk_size = 500000 # Process 500k rows at a time
)
This is particularly useful when: - Uploading to cloud storage with size/transfer bandwidth limits - Sharing data via email or file transfer services - Working with limited storage space
For large databases, you may only need certain types of data:
# Export only pupil timeseries and events
result <- eyeris_db_to_chunked_files(
bids_dir = "/path/to/bids",
db_path = "large-project",
data_types = c("timeseries", "events"),
subjects = c("sub-001", "sub-002", "sub-003") # Specific subjects only
)
Available data types typically include: - timeseries
-
Preprocessed eye-tracking pupil data - events
-
Experimental events - epochs
- Epoched data around
events
- confounds_summary
- Confound variables by epoch -
blinks
- Detected blinks
For better performance and compression, use Parquet format:
result <- eyeris_db_to_chunked_files(
bids_dir = "/path/to/bids",
db_path = "large-project",
file_format = "parquet",
max_file_size_mb = 200
)
Parquet advantages: - Smaller file sizes (often
50-80% smaller than CSV) - Faster reading with
arrow::read_parquet()
- Better data types
(preserves numeric precision) - Column-oriented storage
for analytics
When files are split due to size limits, you can recombine them:
# Find all parts of a split dataset
files <- list.files(
"path/to/eyerisdb_export/my-project/",
pattern = "timeseries_chunked_.*\\.csv$",
full.names = TRUE
)
# Read and combine all parts
combined_data <- do.call(rbind, lapply(files, read.csv))
# Or use the built-in helper function
combined_data <- read_eyeris_parquet(
parquet_dir = "path/to/eyerisdb_export/my-project/",
data_type = "timeseries"
)
For specialized analysis, you can process chunks with custom functions:
# Connect to database directly
con <- eyeris_db_connect("/path/to/bids", "large-project")
# Define custom analysis function for pupil data
analyze_chunk <- function(chunk) {
# Calculate summary statistics for this chunk
stats <- data.frame(
n_rows = nrow(chunk),
subjects = length(unique(chunk$subject_id)),
mean_eye_x = mean(chunk$eye_x, na.rm = TRUE),
mean_eye_y = mean(chunk$eye_y, na.rm = TRUE),
mean_pupil_raw = mean(chunk$pupil_raw, na.rm = TRUE),
mean_pupil_processed = mean(chunk$pupil_raw_deblink_detransient_interpolate_lpfilt_z, na.rm = TRUE),
missing_pupil_pct = sum(is.na(chunk$pupil_raw)) / nrow(chunk) * 100,
hz_modes = paste(unique(chunk$hz), collapse = ",")
)
# Save chunk summary (append to growing file)
write.csv(stats, "chunk_summaries.csv", append = file.exists("chunk_summaries.csv"))
return(TRUE) # Indicate success
}
# Hypothetical example: process large timeseries dataset in chunks
result <- process_chunked_query(
con = con,
query = "
SELECT subject_id, session_id, time_secs, eye_x, eye_y,
pupil_raw, pupil_raw_deblink_detransient_interpolate_lpfilt_z, hz
FROM timeseries_01_enc_clamp_run01
WHERE pupil_raw > 0 AND eye_x IS NOT NULL
ORDER BY time_secs
",
chunk_size = 100000,
process_chunk = analyze_chunk
)
eyeris_db_disconnect(con)
For databases with hundreds of millions of rows:
# Optimize for very large datasets
result <- eyeris_db_to_chunked_files(
bids_dir = "/path/to/bids",
db_path = "massive-project",
chunk_size = 2000000, # 2M rows per chunk for efficiency
max_file_size_mb = 1000, # 1GB files (larger but fewer files)
file_format = "parquet", # Better compression
data_types = "timeseries" # Focus on primary data type for analysis
)
If you encounter out-of-memory errors:
The function automatically handles this by processing tables in batches, but if you encounter issues:
When you see “Set operations can only apply to expressions with the same number of result columns”:
If files are locked or in use:
eyerisdb
database fileFor additional help:
?eyeris_db_to_chunked_files
eyeris_db_summary(bids_dir, db_path)
eyeris_db_list_tables(con)
verbose = TRUE
The built-in chunked eyerisdb
database export
functionality provides a robust solution for working with large
eyerisdb
databases. Key benefits include:
This makes it possible to work with even the largest eye-tracking/pupillometry datasets while maintaining performance/reliability without sacrificing the ability to share high-quality, reproducible datasets that support collaborative and open research.