leaspy.io.data¶

Submodules¶

Attributes¶

DataframeDataReaderFactoryInput

Classes¶

`AbstractDataframeDataReader`	Methods to convert `pandas.DataFrame` to Leaspy-compliant data containers.
`Data`	Main data container for a collection of individuals
`Dataset`	Data container based on `torch.Tensor`, used to run algorithms.
`EventDataframeDataReader`	Methods to convert `pandas.DataFrame` to Leaspy-compliant data containers for event data only.
`DataframeDataReaderNames`	Enumeration defining the possible names for observation models.
`IndividualData`	Container for an individual's data
`JointDataframeDataReader`	Methods to convert `pandas.DataFrame` to Leaspy-compliant data containers for event data and longitudinal data.
`VisitDataframeDataReader`	Methods to convert `pandas.DataFrame` to Leaspy-compliant data containers for longitudinal data only.

Functions¶

dataframe_data_reader_factory(reader, **kwargs)

Factory for observation models.

Package Contents¶

class AbstractDataframeDataReader[source]¶

Methods to convert pandas.DataFrame to Leaspy-compliant data containers.

Raises:

LeaspyDataInputError

time_rounding_digits = 6¶

individuals: dict[IDType, IndividualData]¶

iter_to_idx: dict[int, IDType]¶

n_individuals: int = 0¶

read(df, *, drop_full_nan=True, sort_index=False, warn_empty_column=True)[source]¶

The method that effectively reads the input dataframe (automatically called in __init__).

Parameters:

dfpandas.DataFrame: The dataframe to read.
drop_full_nanbool: Should we drop rows full of nans? (except index)
sort_indexbool: Should we lexsort index? (Keep False as default so not to break many of the downstream tests that check order…)
warn_empty_columnbool: Should we warn when there are empty columns?

Parameters:

df (DataFrame)
drop_full_nan (bool)
sort_index (bool)
warn_empty_column (bool)

Return type:

None

class Data[source]¶

Bases: collections.abc.Iterable

Main data container for a collection of individuals

It can be iterated over and sliced, both of these operations being applied to the underlying individuals attribute.

Attributes:

individualsDict [IDType , IndividualData]: Included individuals and their associated data
iter_to_idxDict [int, IDType]: Maps an integer index to the associated individual ID
headersList [FeatureType]: Feature names
dimensionint: Number of features
n_individualsint: Number of individuals
n_visitsint: Total number of visits
cofactorsList [FeatureType]: Feature names corresponding to cofactors
event_time_namestr: Name of the header that store the time at event in the original dataframe
event_bool_namestr: Name of the header that store the bool at event (censored or observed) in the original dataframe

individuals: dict[IDType, IndividualData]¶

iter_to_idx: dict[int, IDType]¶

headers: list[FeatureType] | None = None¶

event_time_name: str | None = None¶

event_bool_name: str | None = None¶

covariate_names: list[str] | None = None¶

property dimension: int | None¶

Number of features

Returns:

int or None:: Number of features in the dataset. If no features are present, returns None.

Return type:

Optional[int]

property n_individuals: int¶

Number of individuals

Returns:

int:: Number of individuals in the dataset.

Return type:

int

property n_visits: int¶

Total number of visits

Returns:

int:: Total number of visits in the dataset.

Return type:

int

property cofactors: list[FeatureType]¶

Feature names corresponding to cofactors

Returns:

List [FeatureType]:: List of feature names corresponding to cofactors.

Return type:

list[FeatureType]

load_cofactors(df, *, cofactors=None)[source]¶

Load cofactors from a pandas.DataFrame to the Data object

Parameters:

dfpandas.DataFrame: The dataframe where the cofactors are stored. Its index should be ID, the identifier of subjects and it should uniquely index the dataframe (i.e. one row per individual).
cofactorsList [FeatureType], optional: Names of the column(s) of dataframe which shall be loaded as cofactors. If None, all the columns from the input dataframe will be loaded as cofactors. Default: None

Parameters:

df (DataFrame)
cofactors (Optional[list[FeatureType]])

Return type:

None

static from_csv_file(path, data_type='visit', *, pd_read_csv_kws={}, facto_kws={}, **df_reader_kws)[source]¶

Create a Data object from a CSV file.

Parameters:

pathstr: Path to the CSV file to load (with extension)
data_typestr: Type of data to read. Can be ‘visit’ or ‘event’.
pd_read_csv_kwsdict: Keyword arguments that are sent to pandas.read_csv()
facto_kwsdict: Keyword arguments
**df_reader_kws: Keyword arguments that are sent to AbstractDataframeDataReader to dataframe_data_reader_factory()

Returns:

Data:: A Data object containing the data from the CSV file.

Parameters:

path (str)
data_type (str)
pd_read_csv_kws (dict)
facto_kws (dict)

Return type:

Data

to_dataframe(*, cofactors=None, reset_index=True)[source]¶

Convert the Data object to a pandas.DataFrame

Parameters:

cofactorsList [FeatureType] or int, optional: Cofactors to include in the DataFrame. If None (default), no cofactors are included. If “all”, all the available cofactors are included. Default: None
reset_indexbool, optional: Whether to reset index levels in output. Default: True

Returns:

pandas.DataFrame:: A DataFrame containing the individuals’ ID, timepoints and associated observations (optional - and cofactors).

Raises:

LeaspyDataInputError: If the Data object does not contain any cofactors.
LeaspyTypeError: If the cofactors argument is not of a valid type.

Parameters:

cofactors (Optional[Union[list[FeatureType], str]])
reset_index (bool)

Return type:

DataFrame

static from_dataframe(df, data_type='visit', factory_kws={}, **kws)[source]¶

Create a Data object from a DataFrame.

Parameters:

dfpandas.DataFrame: Dataframe containing ID, TIME and features.
data_typestr: Type of data to read. Can be ‘visit’, ‘event’, ‘joint’
factory_kwsDict: Keyword arguments that are sent to dataframe_data_reader_factory()
**kws: Keyword arguments that are sent to DataframeDataReader

Returns:

Data

Parameters:

df (DataFrame)
data_type (str)
factory_kws (dict)

Return type:

Data

static from_individual_values(indices, timepoints=None, values=None, headers=None, event_time_name=None, event_bool_name=None, event_time=None, event_bool=None, covariate_names=None, covariates=None)[source]¶

Construct Data from a collection of individual data points

Parameters:

indicesList [IDType]: List of the individuals’ unique ID
timepointsList [List [float]]: For each individual i, list of timepoints associated with the observations. The number of such timepoints is noted n_timepoints_i
valuesList [array-like [float, 2D]]: For each individual i, two-dimensional array-like object containing observed data points. Its expected shape is (n_timepoints_i, n_features)
headersList [FeatureType]: Feature names. The number of features is noted n_features

Returns:

Data:: A Data object containing the individuals and their data.

Parameters:

indices (list[IDType])
timepoints (Optional[list[list[float]]])
values (Optional[list[list[list[float]]]])
headers (Optional[list[FeatureType]])
event_time_name (Optional[str])
event_bool_name (Optional[str])
event_time (Optional[list[list[float]]])
event_bool (Optional[list[list[int]]])
covariate_names (Optional[list[str]])
covariates (Optional[list[list[int]]])

Return type:

Data

static from_individuals(individuals, headers=None, event_time_name=None, event_bool_name=None, covariate_names=None)[source]¶

Construct Data from a list of individuals

Parameters:

individualsList [IndividualData]: List of individuals
headersList [FeatureType]: List of feature names

Returns:

Data:: A Data object containing the individuals and their data.

Parameters:

individuals (list[IndividualData])
headers (Optional[list[FeatureType]])
event_time_name (Optional[str])
event_bool_name (Optional[str])
covariate_names (Optional[list[str]])

Return type:

Data

extract_longitudinal_only()[source]¶

Extract longitudinal data from the Data object

Returns:

Data:: A Data object containing only longitudinal data.

Raises:

LeaspyDataInputError: If the Data object does not contain any longitudinal data.

Return type:

Data

class Dataset(data, *, no_warning=False)[source]¶

Data container based on torch.Tensor, used to run algorithms.

Parameters:

dataData: Create Dataset from Data object
no_warningbool, default False: Whether to deactivate warnings that are emitted by methods of this dataset instance. We may want to deactivate them because we rebuild a dataset per individual in scipy minimize. Indeed, all relevant warnings certainly occurred for the overall dataset.

Attributes:

headerslist [str]: Features names
dimensionint: Number of features
n_individualsint: Number of individuals
indiceslist [IDType]: Order of patients
event_timetorch.FloatTensor: Time of an event, if the event is censored, the time correspond to the last patient observation
event_booltorch.BoolTensor: Boolean to indicate if an event is censored or not: 1 observed, 0 censored
n_visits_per_individuallist [int]: Number of visits per individual
n_visits_maxint: Maximum number of visits for one individual
n_visitsint: Total number of visits
n_observations_per_ind_per_fttorch.LongTensor, shape (n_individuals, dimension): Number of observations (not taking into account missing values) per individual per feature
n_observations_per_fttorch.LongTensor, shape (dimension,): Total number of observations per feature
n_observationsint: Total number of observations
timepointstorch.FloatTensor, shape (n_individuals, n_visits_max): Ages of patients at their different visits
valuestorch.FloatTensor, shape (n_individuals, n_visits_max, dimension): Values of patients for each visit for each feature
masktorch.FloatTensor, shape (n_individuals, n_visits_max, dimension): Binary mask associated to values. If 1: value is meaningful If 0: value is meaningless (either was nan or does not correspond to a real visit - only here for padding)
L2_norm_per_fttorch.FloatTensor, shape (dimension,): Sum of all non-nan squared values, feature per feature
L2_normscalar torch.FloatTensor: Sum of all non-nan squared values
no_warningbool, default False: Whether to deactivate warnings that are emitted by methods of this dataset instance. We may want to deactivate them because we rebuild a dataset per individual in scipy minimize. Indeed, all relevant warnings certainly occurred for the overall dataset.
_one_hot_encodingdict [bool, torch.LongTensor]: Values of patients for each visit for each feature, but tensorized into a one-hot encoding (pdf or sf) Shapes of tensors are (n_individuals, n_visits_max, dimension, max_ordinal_level [-1 when sf=True])

Raises:

LeaspyInputError: if data, model or algo are not compatible together.

Parameters:

data (Data)
no_warning (bool)

n_individuals¶

indices¶

headers: list[FeatureType]¶

dimension: int¶

n_visits: int¶

timepoints: torch.FloatTensor | None = None¶

values: torch.FloatTensor | None = None¶

mask: torch.FloatTensor | None = None¶

n_observations: int | None = None¶

n_observations_per_ft: torch.LongTensor | None = None¶

n_observations_per_ind_per_ft: torch.LongTensor | None = None¶

n_visits_per_individual: list[int] | None = None¶

n_visits_max: int | None = None¶

event_time_name: str | None¶

event_bool_name: str | None¶

event_time: torch.FloatTensor | None = None¶

event_bool: torch.IntTensor | None = None¶

covariate_names: list[str] | None¶

covariates: torch.IntTensor | None = None¶

L2_norm_per_ft: torch.FloatTensor | None = None¶

L2_norm: torch.FloatTensor | None = None¶

no_warning = False¶

get_times_patient(i)[source]¶

Get ages for patient number i

Parameters:

iint: The index of the patient (<!> not its identifier)

Returns:

torch.Tensor, shape (n_obs_of_patient,): Contains float

Parameters:

i (int)

Return type:

torch.FloatTensor

get_event_patient(idx_patient)[source]¶

Get ages at event for patient number idx_patient

Parameters:

idx_patientint: The index of the patient (<!> not its identifier)

Returns:

tuple [torch.Tensor, torch.Tensor] , shape (n_obs_of_patient,): Contains float

Parameters:

idx_patient (int)

Return type:

tuple[Tensor, Tensor]

get_covariates_patient(idx_patient)[source]¶

Get covariates for patient number idx_patient

Parameters:

idx_patientint: The index of the patient (<!> not its identifier)

Returns:

torch.Tensor, shape (n_obs_of_patient,): Contains float

Raises:

ValueError: If the dataset has no covariates.

Parameters:

idx_patient (int)

Return type:

torch.IntTensor

get_values_patient(i, *, adapt_for_model=None)[source]¶

Get values for patient number i, with nans.

Parameters:

iint

The index of the patient (<!> not its identifier)

adapt_for_modelNone, default or McmcSaemCompatibleModel

The values returned are suited for this model. In particular:

For model with noise_model=’ordinal’ will return one-hot-encoded values [P(X = l), l=0..ordinal_max_level]

For model with noise_model=’ordinal_ranking’ will return survival function values [P(X > l), l=0..ordinal_max_level-1]

If None, we return the raw values, whatever the model is.

Returns:

torch.Tensor, shape (n_obs_of_patient, dimension [, extra_dimension_for_ordinal_models]): Contains float or nans

Parameters:

i (int)

Return type:

torch.FloatTensor

to_pandas(apply_headers=False)[source]¶

Convert dataset to a DataFrame with [‘ID’, ‘TIME’] index, with all covariates, events and repeated measures if apply_headers is False, and only the repeated measures otherwise.

Parameters:

apply_headersbool: Enable to select only the columns that are needed for leaspy fit (headers attribute)

Returns:

pandas.DataFrame: DataFrame with index [‘ID’, ‘TIME’] and columns corresponding to the features, events and covariates.

Raises:

LeaspyInputError: If the index of the DataFrame is not unique or contains invalid values.

Parameters:

apply_headers (bool)

Return type:

DataFrame

move_to_device(device)[source]¶

Moves the dataset to the specified device.

Parameters:

devicetorch.device

Parameters:

device (device)

Return type:

None

get_one_hot_encoding(*, sf, ordinal_infos)[source]¶

Builds the one-hot encoding of ordinal data once and for all and returns it.

Parameters:

sfbool: Whether the vector should be the survival function [1(X > l), l=0..max_level-1] instead of the probability density function [1(X=l), l=0..max_level]
ordinal_infosKwargsType: All the hyperparameters concerning ordinal modelling (in particular maximum level per features)

Returns:

torch.LongTensor: One-hot encoding of data values.

Raises:

LeaspyInputError: If the values are not non-negative integers or if the features in ordinal_infos are not consistent with the dataset headers.

Parameters:

sf (bool)
ordinal_infos (KwargsType)

Return type:

torch.LongTensor

class EventDataframeDataReader(*, event_time_name='EVENT_TIME', event_bool_name='EVENT_BOOL', nb_events=None)[source]¶

Bases: leaspy.io.data.abstract_dataframe_data_reader.AbstractDataframeDataReader

Methods to convert pandas.DataFrame to Leaspy-compliant data containers for event data only.

Parameters:

event_time_name: str: Name of the columns in dataframe that contains the time of event
event_bool_name: str: Name of the columns in dataframe that contains if the event is censored of not

Raises:

LeaspyDataInputError

Parameters:

event_time_name (str)
event_bool_name (str)
nb_events (Optional[int])

event_time_name = 'EVENT_TIME'¶

event_bool_name = 'EVENT_BOOL'¶

nb_events = None¶

DataframeDataReaderFactoryInput¶

class DataframeDataReaderNames(*args, **kwds)[source]¶

Bases: enum.Enum

Enumeration defining the possible names for observation models.

EVENT = 'event'¶

VISIT = 'visit'¶

JOINT = 'joint'¶

COVARIATE = 'covariate'¶

classmethod from_string(reader_name)[source]¶

Returns the enum member corresponding to the given string.

Parameters:

reader_namestr: The name of the reader, case-insensitive.

Returns:

DataframeDataReaderNames: The corresponding enum member.

Raises:

NotImplementedError: If the provided reader_name does not match any of the enum members and is not implemented. Give the valid names in the error message.

Parameters:

reader_name (str)

dataframe_data_reader_factory(reader, **kwargs)[source]¶

Factory for observation models.

Parameters:

modelstr or obs_models or dict [ str, …]

If obs_models, returns the instance.
If a string, then returns a new instance of the appropriate class (with optional parameters kws).
If a dictionary, it must contain the ‘name’ key and other initialization parameters.

**kwargs

Optional parameters for initializing the requested observation model when a string.

Returns:

AbstractDataframeDataReader: The desired observation model.

Raises:

LeaspyModelInputError: If model is not supported.

Parameters:

reader (DataframeDataReaderFactoryInput)

Return type:

AbstractDataframeDataReader

class IndividualData(idx)[source]¶

Container for an individual’s data

Parameters:

idxIDType: Unique ID

Attributes:

idxIDType: Unique ID
timepointsnp.ndarray [float]: Timepoints associated with the observations 1D array
observationsnp.ndarray [float]: Observed data points, Shape is (n_timepoints, n_features)
cofactorsdict [FeatureType, Any]: Cofactors in the form {cofactor_name: cofactor_value}
event_timefloat: Time of an event, if the event is censored, the time correspond to the last patient observation
event_boolbool: Boolean to indicate if an event is censored or not: 1 observed, 0 censored

Parameters:

idx (IDType)

idx: IDType¶

timepoints: ndarray = None¶

observations: ndarray = None¶

event_time: ndarray | None = None¶

event_bool: ndarray | None = None¶

cofactors: dict[FeatureType, Any]¶

covariates: ndarray | None = None¶

add_observations(timepoints, observations)[source]¶

Include new observations and associated timepoints

Parameters:

timepointsarray-like [float]: Timepoints associated with the observations to include, 1D array
observationsarray-like [float]: Observations to include, 2D array

Raises:

LeaspyDataInputError

Parameters:

timepoints (list[float])
observations (list[list[float]])

Return type:

None

add_event(event_time, event_bool)[source]¶

Include event time and associated censoring bool

Parameters:

event_timefloat: Time of the event
event_boolfloat: 0 if censored (not observed) and 1 if observed

Parameters:

event_time (list[float])
event_bool (list[bool])

Return type:

None

add_covariates(covariates)[source]¶

Include covariates

Parameters:

covariatesarray-like [float]: Covariates to include, 2D array

Parameters:

covariates (list[list[int]])

Return type:

None

add_cofactors(cofactors)[source]¶

Include new cofactors

Parameters:

cofactorsdict [FeatureType, Any]: Cofactors to include, in the form {name: value}

Raises:

LeaspyDataInputError
LeaspyTypeError

Parameters:

cofactors (dict[FeatureType, Any])

Return type:

None

to_frame(headers, event_time_name, event_bool_name, covariate_names)[source]¶

Convert the individual data to a pandas DataFrame

Parameters:

headerslist [str]: List of feature names for the observations
event_time_namestr: Name of the column for the event time
event_bool_namestr: Name of the column for the event boolean (0 or 1)
covariate_nameslist [str]: List of covariate names

Returns:

pd.DataFrame

DataFrame containing the individual’s data with the following columns:

ID: Unique identifier for the individual
TIME: Timepoints associated with the observations
Observations: Observed data points for each feature
Event Time: Time of the event (if any)
Event Boolean: Boolean indicating if the event was observed (1) or censored (0)
Covariates: Values of the covariates for the individual

Parameters:

headers (list)
event_time_name (str)
event_bool_name (str)
covariate_names (list[str])

Return type:

DataFrame

class JointDataframeDataReader(*, event_time_name='EVENT_TIME', event_bool_name='EVENT_BOOL', nb_events=None)[source]¶

Bases: leaspy.io.data.abstract_dataframe_data_reader.AbstractDataframeDataReader

Methods to convert pandas.DataFrame to Leaspy-compliant data containers for event data and longitudinal data.

Parameters:

event_time_name: str: Name of the columns in dataframe that contains the time of event
event_bool_name: str: Name of the columns in dataframe that contains if the event is censored of not

Raises:

LeaspyDataInputError

Parameters:

event_time_name (str)
event_bool_name (str)
nb_events (Optional[int])

tol_diff = 0.001¶

visit_reader¶

event_reader¶

property event_time_name: str¶

Name of the event time column in dataset

Return type:: str

property event_bool_name: str¶

Name of the event bool column in dataset

Return type:: str

property dimension: int | None¶

Number of longitudinal outcomes in dataset.

Return type:: Optional[int]

property long_outcome_names: list[FeatureType]¶

Name of the longitudinal outcomes in dataset

Return type:: list[FeatureType]

property n_visits: int¶

Number of visit in the dataset

Return type:: int

class VisitDataframeDataReader[source]¶

Bases: leaspy.io.data.abstract_dataframe_data_reader.AbstractDataframeDataReader

Methods to convert pandas.DataFrame to Leaspy-compliant data containers for longitudinal data only. Raises —— LeaspyDataInputError

property dimension: int | None¶

Number of longitudinal outcomes in dataset.

Returns:

: int: Number of longitudinal outcomes in dataset

Return type:

Optional[int]