license-shield interrogate-badge coverage-badge versions-shield

Timebased Cross Validation¶

timebasedcv is a Python codebase that provides a cross validation strategy based on time.

Documentation | Repository | Issue Tracker

Disclaimer ⚠️¶

This codebase is experimental and is working for my use cases. It is very probable that there are cases not entirely covered and for which it could break (badly). If you find them, please feel free to open an issue in the issue page of the repo.

Description ✨¶

The current implementation of scikit-learn TimeSeriesSplit lacks the flexibility of having multiple samples within the same time period (or time unit).

timebasedcv addresses such problem by providing a cross validation strategy based on a time period rather than the number of samples. This is useful when the data is time dependent, and the split should keep together samples within the same time window.

Temporal data leakage is an issue and we want to prevent it from happening by providing splits that make sure the past and the future are well separated, so that data leakage does not spoil in a model cross validation.

Again, these splits points solely depend on the time period and not the number of observations.

Features 📜¶

We introduce two main classes:

TimeBasedSplit allows to define a split based on time unit (frequency), train size, test size, gap, stride, window type and mode.
Warning

TimeBasedSplit is not compatible with scikit-learn CV Splitters.

In fact, we have made the (opinioned) choice to:
- Return the sliced arrays from .split(...), while scikit-learn CV Splitters return train and test indices of the split.
- Require to pass the time series as input to .split(...) method, while scikit-learn CV Splitters require to provide only X, y, groups to .split(...).
- Such time series is used to generate the boolean masks with which we slice the original arrays into train and test for each split.
Considering the above choices, we also provide a scikit-learn compatible splitter: TimeBasedCVSplitter. Considering the signature that .split(...) requires and the fact that CV Splitters need to know a priori the number of splits, TimeBasedCVSplitter is initialized with the time series containing the time information used to generate the train and test indices of each split.

Dataframe and array agnostic¶

Thanks to Narwhals, TimeBasedSplit works out of the box with pandas, polars, pyarrow and any other dataframe library supported by Narwhals.
Thanks to the array API, TimeBasedSplit works out of the box with numpy, cupy, dask.array and any other array library that support slicing à la numpy.

Installation 💻¶

TL;DR:

python -m pip install timebasedcv

For further information, please refer to the dedicated installation section.

Getting Started 🏃¶

Please refer to the dedicated getting started section.

Contributing ✌️¶

Please refer to the dedicated contributing guidelines section.

License 👀¶

The project has a MIT Licence.