`timebasedcv.sklearn`¶

timebasedcv.sklearn.TimeBasedCVSplitter ¶

Bases: _BaseKFold

The TimeBasedCVSplitter is a scikit-learn compatible CV Splitter that generates splits based on time values.

The number of sample in each split is independent of the number of splits but based purely on the timestamp of the sample.

In order to achieve such behaviour we include the arguments of TimeBasedSplit.split() method (namely time_series, start_dt and end_dt) in the constructor (a.k.a. __init__ method) and store the for future use in its split and get_n_splits methods.

In this way we can restrict the arguments of split and get_n_splits to the arrays to split (i.e. X, y and groups), which are the only arguments required by scikit-learn CV Splitters.

Parameters:

Name	Type	Description	Default
`frequency`	`FrequencyUnit`	The frequency (or time unit) of the time series. Must be one of "days", "seconds", "microseconds", "milliseconds", "minutes", "hours", "weeks", "months" or "years". These are the valid values for the `unit` argument of `relativedelta` from python `dateutil` library.	required
`train_size`	`int`	Defines the minimum number of time units required to be in the train set.	required
`forecast_horizon`	`int`	Specifies the number of time units to forecast.	required
`time_series`	`SeriesLike[date] \| SeriesLike[datetime] \| SeriesLike[Timestamp]`	The time series used to create boolean mask for splits. It is not required to be sorted, but it must support: comparison operators (with other date-like objects). bitwise operators (with other boolean arrays). `.min()` and `.max()` methods. `.shape` attribute.	required
`gap`	`int`	Sets the number of time units to skip between the end of the train set and the start of the forecast set.	`0`
`stride`	`int \| None`	How many time unit to move forward after each split. If `None` (or set to 0), the stride is equal to the `forecast_horizon` quantity.	`None`
`window`	`WindowType`	The type of window to use, either "rolling" or "expanding".	`'rolling'`
`mode`	`ModeType`	Determines in which orders the splits are generated, either "forward" (start to end) or "backward" (end to start).	`'forward'`
`start_dt`	`NullableDatetime`	The start of the time period. If provided, it is used in place of the `time_series.min()`.	`None`
`end_dt`	`NullableDatetime`	The end of the time period. If provided,it is used in place of the `time_series.max()`.	`None`

Raises:

Type	Description
`ValueError`	If `frequency` is not one of "days", "seconds", "microseconds", "milliseconds", "minutes", "hours", "weeks". If `window` is not one of "rolling" or "expanding". If `mode` is not one of "forward" or "backward" If `train_size`, `forecast_horizon`, `gap` or `stride` are not strictly positive.
`TypeError`	If `train_size`, `forecast_horizon`, `gap` or `stride` are not of type `int`.

Examples:

import pandas as pd
import numpy as np

from sklearn.linear_model import Ridge
from sklearn.model_selection import RandomizedSearchCV

from timebasedcv.sklearn import TimeBasedCVSplitter

start_dt = pd.Timestamp(2023, 1, 1)
end_dt = pd.Timestamp(2023, 1, 31)

time_series = pd.Series(pd.date_range(start_dt, end_dt, freq="D"))
size = len(time_series)

df = pd.DataFrame(data=np.random.randn(size, 2), columns=["a", "b"])

X, y = df[["a", "b"]], df[["a", "b"]].sum(axis=1)

cv = TimeBasedCVSplitter(
    frequency="days",
    train_size=7,
    forecast_horizon=11,
    gap=0,
    stride=1,
    window="rolling",
    time_series=time_series,
    start_dt=start_dt,
    end_dt=end_dt,
)

param_grid = {
    "alpha": np.linspace(0.1, 2, 10),
    "fit_intercept": [True, False],
    "positive": [True, False],
}

random_search_cv = RandomizedSearchCV(
    estimator=Ridge(),
    param_distributions=param_grid,
    cv=cv,
    n_jobs=-1,
).fit(X, y)

Source code in timebasedcv/sklearn.py

class TimeBasedCVSplitter(_BaseKFold):  # type: ignore[no-any-unimported]
    """The `TimeBasedCVSplitter` is a scikit-learn compatible CV Splitter that generates splits based on time values.

    The number of sample in each split is independent of the number of splits but based purely on the timestamp of the
    sample.

    In order to achieve such behaviour we include the arguments of
    [`TimeBasedSplit.split()`][timebasedcv.core.TimeBasedSplit.split] method (namely `time_series`, `start_dt` and
    `end_dt`) in the constructor (a.k.a. `__init__` method) and store the for future use in its `split` and
    `get_n_splits` methods.

    In this way we can restrict the arguments of `split` and `get_n_splits` to the arrays to split (i.e. `X`, `y` and
    `groups`), which are the only arguments required by scikit-learn CV Splitters.

    Arguments:
        frequency: The frequency (or time unit) of the time series. Must be one of "days", "seconds", "microseconds",
            "milliseconds", "minutes", "hours", "weeks", "months" or "years". These are the valid values for the
            `unit` argument of `relativedelta` from python `dateutil` library.
        train_size: Defines the minimum number of time units required to be in the train set.
        forecast_horizon: Specifies the number of time units to forecast.
        time_series: The time series used to create boolean mask for splits. It is not required to be sorted, but it
            must support:

            - comparison operators (with other date-like objects).
            - bitwise operators (with other boolean arrays).
            - `.min()` and `.max()` methods.
            - `.shape` attribute.
        gap: Sets the number of time units to skip between the end of the train set and the start of the forecast set.
        stride: How many time unit to move forward after each split. If `None` (or set to 0), the stride is equal to the
            `forecast_horizon` quantity.
        window: The type of window to use, either "rolling" or "expanding".
        mode: Determines in which orders the splits are generated, either "forward" (start to end) or "backward"
            (end to start).
        start_dt: The start of the time period. If provided, it is used in place of the `time_series.min()`.
        end_dt: The end of the time period. If provided,it is used in place of the `time_series.max()`.

    Raises:
        ValueError:
            - If `frequency` is not one of "days", "seconds", "microseconds", "milliseconds", "minutes", "hours",
            "weeks".
            - If `window` is not one of "rolling" or "expanding".
            - If `mode` is not one of "forward" or "backward"
            - If `train_size`, `forecast_horizon`, `gap` or `stride` are not strictly positive.
        TypeError: If `train_size`, `forecast_horizon`, `gap` or `stride` are not of type `int`.

    Examples:
        ```python
        import pandas as pd
        import numpy as np

        from sklearn.linear_model import Ridge
        from sklearn.model_selection import RandomizedSearchCV

        from timebasedcv.sklearn import TimeBasedCVSplitter

        start_dt = pd.Timestamp(2023, 1, 1)
        end_dt = pd.Timestamp(2023, 1, 31)

        time_series = pd.Series(pd.date_range(start_dt, end_dt, freq="D"))
        size = len(time_series)

        df = pd.DataFrame(data=np.random.randn(size, 2), columns=["a", "b"])

        X, y = df[["a", "b"]], df[["a", "b"]].sum(axis=1)

        cv = TimeBasedCVSplitter(
            frequency="days",
            train_size=7,
            forecast_horizon=11,
            gap=0,
            stride=1,
            window="rolling",
            time_series=time_series,
            start_dt=start_dt,
            end_dt=end_dt,
        )

        param_grid = {
            "alpha": np.linspace(0.1, 2, 10),
            "fit_intercept": [True, False],
            "positive": [True, False],
        }

        random_search_cv = RandomizedSearchCV(
            estimator=Ridge(),
            param_distributions=param_grid,
            cv=cv,
            n_jobs=-1,
        ).fit(X, y)
        ```
    """

    def __init__(  # noqa: PLR0913
        self: Self,
        *,
        frequency: FrequencyUnit,
        train_size: int,
        forecast_horizon: int,
        time_series: SeriesLike[date] | SeriesLike[datetime] | SeriesLike[pd.Timestamp],
        gap: int = 0,
        stride: int | None = None,
        window: WindowType = "rolling",
        mode: ModeType = "forward",
        start_dt: NullableDatetime = None,
        end_dt: NullableDatetime = None,
    ) -> None:
        self.splitter = TimeBasedSplit(
            frequency=frequency,
            train_size=train_size,
            forecast_horizon=forecast_horizon,
            gap=gap,
            stride=stride,
            window=window,
            mode=mode,
        )

        self.time_series_ = time_series
        self.start_dt_ = start_dt
        self.end_dt_ = end_dt

        self.n_splits = self._compute_n_splits()
        self.size_ = time_series.shape[0]

    def split(
        self: Self,
        X: NDArray | None = None,
        y: NDArray | None = None,
        groups: NDArray | None = None,
    ) -> Generator[tuple[NDArray[np.int_], NDArray[np.int_]], None, None]:
        """Generates integer indices corresponding to train and test sets.

        Arguments:
            X: Optional input features array.
            y: Optional target variable array.
            groups: Optional array containing group labels for the samples.

        Returns:
            A generator that yields tuples of train and test indices.

        Raises:
            ValueError: If the input arrays have incompatible lengths with reference `time_series`.

        """
        self._validate_split_args(self.size_, X, y, groups)

        _indexes = np.arange(self.size_)

        yield from self.splitter.split(  # type: ignore[call-overload]
            _indexes,
            time_series=self.time_series_,
            start_dt=self.start_dt_,
            end_dt=self.end_dt_,
            return_splitstate=False,
        )

    def get_n_splits(
        self: Self,
        X: NDArray | None = None,
        y: NDArray | None = None,
        groups: NDArray | None = None,
    ) -> int:
        """Returns the number of splits that can be generated from the instance.

        Arguments:
            X: Unused, exists for compatibility, checked if not None.
            y: Unused, exists for compatibility, checked if not None.
            groups: Unused, exists for compatibility, checked if not None.

        Returns:
            The number of splits that can be generated from `time_series`.
        """
        self._validate_split_args(self.size_, X, y, groups)
        return self.n_splits

    def _compute_n_splits(self: Self) -> int:
        """Computes number of splits just once in the init."""
        time_start = self.start_dt_ or self.time_series_.min()
        time_end = self.end_dt_ or self.time_series_.max()

        return len(tuple(self.splitter._splits_from_period(time_start, time_end)))  # noqa: SLF001

    @staticmethod
    def _validate_split_args(
        size: int,
        X: NDArray | None = None,
        y: NDArray | None = None,
        groups: NDArray | None = None,
    ) -> None:
        """Validates the arguments passed to the `split` and `get_n_splits` methods."""
        if X is not None and X.shape[0] != size:
            msg = f"Invalid shape: {X.shape[0]=} doesn't match time_series.shape[0]={size}"
            raise ValueError(msg)

        if y is not None and y.shape[0] != size:
            msg = f"Invalid shape: {y.shape[0]=} doesn't match time_series.shape[0]={size}"
            raise ValueError(msg)

        if groups is not None and groups.shape[0] != size:
            msg = f"Invalid shape: {groups.shape[0]=} doesn't match time_series.shape[0]={size}"
            raise ValueError(msg)

split ¶

split(X: NDArray | None = None, y: NDArray | None = None, groups: NDArray | None = None) -> Generator[tuple[NDArray[int_], NDArray[int_]], None, None]

Generates integer indices corresponding to train and test sets.

Parameters:

Name	Type	Description	Default
`X`	`NDArray \| None`	Optional input features array.	`None`
`y`	`NDArray \| None`	Optional target variable array.	`None`
`groups`	`NDArray \| None`	Optional array containing group labels for the samples.	`None`

Returns:

Type	Description
`None`	A generator that yields tuples of train and test indices.

Raises:

Type	Description
`ValueError`	If the input arrays have incompatible lengths with reference `time_series`.

Source code in timebasedcv/sklearn.py

def split(
    self: Self,
    X: NDArray | None = None,
    y: NDArray | None = None,
    groups: NDArray | None = None,
) -> Generator[tuple[NDArray[np.int_], NDArray[np.int_]], None, None]:
    """Generates integer indices corresponding to train and test sets.

    Arguments:
        X: Optional input features array.
        y: Optional target variable array.
        groups: Optional array containing group labels for the samples.

    Returns:
        A generator that yields tuples of train and test indices.

    Raises:
        ValueError: If the input arrays have incompatible lengths with reference `time_series`.

    """
    self._validate_split_args(self.size_, X, y, groups)

    _indexes = np.arange(self.size_)

    yield from self.splitter.split(  # type: ignore[call-overload]
        _indexes,
        time_series=self.time_series_,
        start_dt=self.start_dt_,
        end_dt=self.end_dt_,
        return_splitstate=False,
    )

get_n_splits ¶

get_n_splits(X: NDArray | None = None, y: NDArray | None = None, groups: NDArray | None = None) -> int

Returns the number of splits that can be generated from the instance.

Parameters:

Name	Type	Description	Default
`X`	`NDArray \| None`	Unused, exists for compatibility, checked if not None.	`None`
`y`	`NDArray \| None`	Unused, exists for compatibility, checked if not None.	`None`
`groups`	`NDArray \| None`	Unused, exists for compatibility, checked if not None.	`None`

Returns:

Type	Description
`int`	The number of splits that can be generated from `time_series`.

Source code in timebasedcv/sklearn.py

def get_n_splits(
    self: Self,
    X: NDArray | None = None,
    y: NDArray | None = None,
    groups: NDArray | None = None,
) -> int:
    """Returns the number of splits that can be generated from the instance.

    Arguments:
        X: Unused, exists for compatibility, checked if not None.
        y: Unused, exists for compatibility, checked if not None.
        groups: Unused, exists for compatibility, checked if not None.

    Returns:
        The number of splits that can be generated from `time_series`.
    """
    self._validate_split_args(self.size_, X, y, groups)
    return self.n_splits

timebasedcv.sklearn¶