leaspy.io.data

Submodules

Attributes

Classes

AbstractDataframeDataReader

Methods to convert pandas.DataFrame to Leaspy-compliant data containers.

Data

Main data container for a collection of individuals

Dataset

Data container based on torch.Tensor, used to run algorithms.

EventDataframeDataReader

Methods to convert pandas.DataFrame to Leaspy-compliant data containers for event data only.

DataframeDataReaderNames

Enumeration defining the possible names for observation models.

IndividualData

Container for an individual's data

JointDataframeDataReader

Methods to convert pandas.DataFrame to Leaspy-compliant data containers for event data and longitudinal data.

VisitDataframeDataReader

Methods to convert pandas.DataFrame to Leaspy-compliant data containers for longitudinal data only.

Functions

dataframe_data_reader_factory(reader, **kwargs)

Factory for observation models.

Package Contents

class AbstractDataframeDataReader[source]

Methods to convert pandas.DataFrame to Leaspy-compliant data containers.

Raises:
LeaspyDataInputError
time_rounding_digits = 6
individuals: dict[IDType, IndividualData]
iter_to_idx: dict[int, IDType]
n_individuals: int = 0
read(df, *, drop_full_nan=True, sort_index=False, warn_empty_column=True)[source]

The method that effectively reads the input dataframe (automatically called in __init__).

Parameters:
dfpandas.DataFrame

The dataframe to read.

drop_full_nanbool

Should we drop rows full of nans? (except index)

sort_indexbool

Should we lexsort index? (Keep False as default so not to break many of the downstream tests that check order…)

warn_empty_columnbool

Should we warn when there are empty columns?

Parameters:
Return type:

None

class Data[source]

Bases: collections.abc.Iterable

Main data container for a collection of individuals

It can be iterated over and sliced, both of these operations being applied to the underlying individuals attribute.

Attributes:
individualsDict [IDType , IndividualData]

Included individuals and their associated data

iter_to_idxDict [int, IDType]

Maps an integer index to the associated individual ID

headersList [FeatureType]

Feature names

dimensionint

Number of features

n_individualsint

Number of individuals

n_visitsint

Total number of visits

cofactorsList [FeatureType]

Feature names corresponding to cofactors

event_time_namestr

Name of the header that store the time at event in the original dataframe

event_bool_namestr

Name of the header that store the bool at event (censored or observed) in the original dataframe

individuals: dict[IDType, IndividualData]
iter_to_idx: dict[int, IDType]
headers: list[FeatureType] | None = None
event_time_name: str | None = None
event_bool_name: str | None = None
covariate_names: list[str] | None = None
property dimension: int | None

Number of features

Returns:
int or None:

Number of features in the dataset. If no features are present, returns None.

Return type:

Optional[int]

property n_individuals: int

Number of individuals

Returns:
int:

Number of individuals in the dataset.

Return type:

int

property n_visits: int

Total number of visits

Returns:
int:

Total number of visits in the dataset.

Return type:

int

property cofactors: list[FeatureType]

Feature names corresponding to cofactors

Returns:
List [FeatureType]:

List of feature names corresponding to cofactors.

Return type:

list[FeatureType]

load_cofactors(df, *, cofactors=None)[source]

Load cofactors from a pandas.DataFrame to the Data object

Parameters:
dfpandas.DataFrame

The dataframe where the cofactors are stored. Its index should be ID, the identifier of subjects and it should uniquely index the dataframe (i.e. one row per individual).

cofactorsList [FeatureType], optional

Names of the column(s) of dataframe which shall be loaded as cofactors. If None, all the columns from the input dataframe will be loaded as cofactors. Default: None

Parameters:
Return type:

None

static from_csv_file(path, data_type='visit', *, pd_read_csv_kws={}, facto_kws={}, **df_reader_kws)[source]

Create a Data object from a CSV file.

Parameters:
pathstr

Path to the CSV file to load (with extension)

data_typestr

Type of data to read. Can be ‘visit’ or ‘event’.

pd_read_csv_kwsdict

Keyword arguments that are sent to pandas.read_csv()

facto_kwsdict

Keyword arguments

**df_reader_kws

Keyword arguments that are sent to AbstractDataframeDataReader to dataframe_data_reader_factory()

Returns:
Data:

A Data object containing the data from the CSV file.

Parameters:
Return type:

Data

to_dataframe(*, cofactors=None, reset_index=True)[source]

Convert the Data object to a pandas.DataFrame

Parameters:
cofactorsList [FeatureType] or int, optional

Cofactors to include in the DataFrame. If None (default), no cofactors are included. If “all”, all the available cofactors are included. Default: None

reset_indexbool, optional

Whether to reset index levels in output. Default: True

Returns:
pandas.DataFrame:

A DataFrame containing the individuals’ ID, timepoints and associated observations (optional - and cofactors).

Raises:
LeaspyDataInputError

If the Data object does not contain any cofactors.

LeaspyTypeError

If the cofactors argument is not of a valid type.

Parameters:
Return type:

DataFrame

static from_dataframe(df, data_type='visit', factory_kws={}, **kws)[source]

Create a Data object from a DataFrame.

Parameters:
dfpandas.DataFrame

Dataframe containing ID, TIME and features.

data_typestr

Type of data to read. Can be ‘visit’, ‘event’, ‘joint’

factory_kwsDict

Keyword arguments that are sent to dataframe_data_reader_factory()

**kws

Keyword arguments that are sent to DataframeDataReader

Returns:
Data
Parameters:
Return type:

Data

static from_individual_values(indices, timepoints=None, values=None, headers=None, event_time_name=None, event_bool_name=None, event_time=None, event_bool=None, covariate_names=None, covariates=None)[source]

Construct Data from a collection of individual data points

Parameters:
indicesList [IDType]

List of the individuals’ unique ID

timepointsList [List [float]]

For each individual i, list of timepoints associated with the observations. The number of such timepoints is noted n_timepoints_i

valuesList [array-like [float, 2D]]

For each individual i, two-dimensional array-like object containing observed data points. Its expected shape is (n_timepoints_i, n_features)

headersList [FeatureType]

Feature names. The number of features is noted n_features

Returns:
Data:

A Data object containing the individuals and their data.

Parameters:
Return type:

Data

static from_individuals(individuals, headers=None, event_time_name=None, event_bool_name=None, covariate_names=None)[source]

Construct Data from a list of individuals

Parameters:
individualsList [IndividualData]

List of individuals

headersList [FeatureType]

List of feature names

Returns:
Data:

A Data object containing the individuals and their data.

Parameters:
Return type:

Data

extract_longitudinal_only()[source]

Extract longitudinal data from the Data object

Returns:
Data:

A Data object containing only longitudinal data.

Raises:
LeaspyDataInputError

If the Data object does not contain any longitudinal data.

Return type:

Data

class Dataset(data, *, no_warning=False)[source]

Data container based on torch.Tensor, used to run algorithms.

Parameters:
dataData

Create Dataset from Data object

no_warningbool, default False

Whether to deactivate warnings that are emitted by methods of this dataset instance. We may want to deactivate them because we rebuild a dataset per individual in scipy minimize. Indeed, all relevant warnings certainly occurred for the overall dataset.

Attributes:
headerslist [str]

Features names

dimensionint

Number of features

n_individualsint

Number of individuals

indiceslist [IDType]

Order of patients

event_timetorch.FloatTensor

Time of an event, if the event is censored, the time correspond to the last patient observation

event_booltorch.BoolTensor

Boolean to indicate if an event is censored or not: 1 observed, 0 censored

n_visits_per_individuallist [int]

Number of visits per individual

n_visits_maxint

Maximum number of visits for one individual

n_visitsint

Total number of visits

n_observations_per_ind_per_fttorch.LongTensor, shape (n_individuals, dimension)

Number of observations (not taking into account missing values) per individual per feature

n_observations_per_fttorch.LongTensor, shape (dimension,)

Total number of observations per feature

n_observationsint

Total number of observations

timepointstorch.FloatTensor, shape (n_individuals, n_visits_max)

Ages of patients at their different visits

valuestorch.FloatTensor, shape (n_individuals, n_visits_max, dimension)

Values of patients for each visit for each feature

masktorch.FloatTensor, shape (n_individuals, n_visits_max, dimension)

Binary mask associated to values. If 1: value is meaningful If 0: value is meaningless (either was nan or does not correspond to a real visit - only here for padding)

L2_norm_per_fttorch.FloatTensor, shape (dimension,)

Sum of all non-nan squared values, feature per feature

L2_normscalar torch.FloatTensor

Sum of all non-nan squared values

no_warningbool, default False

Whether to deactivate warnings that are emitted by methods of this dataset instance. We may want to deactivate them because we rebuild a dataset per individual in scipy minimize. Indeed, all relevant warnings certainly occurred for the overall dataset.

_one_hot_encodingdict [bool, torch.LongTensor]

Values of patients for each visit for each feature, but tensorized into a one-hot encoding (pdf or sf) Shapes of tensors are (n_individuals, n_visits_max, dimension, max_ordinal_level [-1 when sf=True])

Raises:
LeaspyInputError

if data, model or algo are not compatible together.

Parameters:
n_individuals
indices
headers: list[FeatureType]
dimension: int
n_visits: int
timepoints: torch.FloatTensor | None = None
values: torch.FloatTensor | None = None
mask: torch.FloatTensor | None = None
n_observations: int | None = None
n_observations_per_ft: torch.LongTensor | None = None
n_observations_per_ind_per_ft: torch.LongTensor | None = None
n_visits_per_individual: list[int] | None = None
n_visits_max: int | None = None
event_time_name: str | None
event_bool_name: str | None
event_time: torch.FloatTensor | None = None
event_bool: torch.IntTensor | None = None
covariate_names: list[str] | None
covariates: torch.IntTensor | None = None
L2_norm_per_ft: torch.FloatTensor | None = None
L2_norm: torch.FloatTensor | None = None
no_warning = False
get_times_patient(i)[source]

Get ages for patient number i

Parameters:
iint

The index of the patient (<!> not its identifier)

Returns:
torch.Tensor, shape (n_obs_of_patient,)

Contains float

Parameters:

i (int)

Return type:

torch.FloatTensor

get_event_patient(idx_patient)[source]

Get ages at event for patient number idx_patient

Parameters:
idx_patientint

The index of the patient (<!> not its identifier)

Returns:
tuple [torch.Tensor, torch.Tensor] , shape (n_obs_of_patient,)

Contains float

Parameters:

idx_patient (int)

Return type:

tuple[Tensor, Tensor]

get_covariates_patient(idx_patient)[source]

Get covariates for patient number idx_patient

Parameters:
idx_patientint

The index of the patient (<!> not its identifier)

Returns:
torch.Tensor, shape (n_obs_of_patient,)

Contains float

Raises:
ValueError

If the dataset has no covariates.

Parameters:

idx_patient (int)

Return type:

torch.IntTensor

get_values_patient(i, *, adapt_for_model=None)[source]

Get values for patient number i, with nans.

Parameters:
iint

The index of the patient (<!> not its identifier)

adapt_for_modelNone, default or McmcSaemCompatibleModel

The values returned are suited for this model. In particular:

  • For model with noise_model=’ordinal’ will return one-hot-encoded values [P(X = l), l=0..ordinal_max_level]

  • For model with noise_model=’ordinal_ranking’ will return survival function values [P(X > l), l=0..ordinal_max_level-1]

If None, we return the raw values, whatever the model is.

Returns:
torch.Tensor, shape (n_obs_of_patient, dimension [, extra_dimension_for_ordinal_models])

Contains float or nans

Parameters:

i (int)

Return type:

torch.FloatTensor

to_pandas(apply_headers=False)[source]

Convert dataset to a DataFrame with [‘ID’, ‘TIME’] index, with all covariates, events and repeated measures if apply_headers is False, and only the repeated measures otherwise.

Parameters:
apply_headersbool

Enable to select only the columns that are needed for leaspy fit (headers attribute)

Returns:
pandas.DataFrame

DataFrame with index [‘ID’, ‘TIME’] and columns corresponding to the features, events and covariates.

Raises:
LeaspyInputError

If the index of the DataFrame is not unique or contains invalid values.

Parameters:

apply_headers (bool)

Return type:

DataFrame

move_to_device(device)[source]

Moves the dataset to the specified device.

Parameters:
devicetorch.device
Parameters:

device (device)

Return type:

None

get_one_hot_encoding(*, sf, ordinal_infos)[source]

Builds the one-hot encoding of ordinal data once and for all and returns it.

Parameters:
sfbool

Whether the vector should be the survival function [1(X > l), l=0..max_level-1] instead of the probability density function [1(X=l), l=0..max_level]

ordinal_infosKwargsType

All the hyperparameters concerning ordinal modelling (in particular maximum level per features)

Returns:
torch.LongTensor

One-hot encoding of data values.

Raises:
LeaspyInputError

If the values are not non-negative integers or if the features in ordinal_infos are not consistent with the dataset headers.

Parameters:
Return type:

torch.LongTensor

class EventDataframeDataReader(*, event_time_name='EVENT_TIME', event_bool_name='EVENT_BOOL', nb_events=None)[source]

Bases: leaspy.io.data.abstract_dataframe_data_reader.AbstractDataframeDataReader

Methods to convert pandas.DataFrame to Leaspy-compliant data containers for event data only.

Parameters:
event_time_name: str

Name of the columns in dataframe that contains the time of event

event_bool_name: str

Name of the columns in dataframe that contains if the event is censored of not

Raises:
LeaspyDataInputError
Parameters:
  • event_time_name (str)

  • event_bool_name (str)

  • nb_events (Optional[int])

event_time_name = 'EVENT_TIME'
event_bool_name = 'EVENT_BOOL'
nb_events = None
DataframeDataReaderFactoryInput
class DataframeDataReaderNames(*args, **kwds)[source]

Bases: enum.Enum

Enumeration defining the possible names for observation models.

EVENT = 'event'
VISIT = 'visit'
JOINT = 'joint'
COVARIATE = 'covariate'
classmethod from_string(reader_name)[source]

Returns the enum member corresponding to the given string.

Parameters:
reader_namestr

The name of the reader, case-insensitive.

Returns:
DataframeDataReaderNames

The corresponding enum member.

Raises:
NotImplementedError

If the provided reader_name does not match any of the enum members and is not implemented. Give the valid names in the error message.

Parameters:

reader_name (str)

dataframe_data_reader_factory(reader, **kwargs)[source]

Factory for observation models.

Parameters:
modelstr or obs_models or dict [ str, …]
  • If obs_models, returns the instance.

  • If a string, then returns a new instance of the appropriate class (with optional parameters kws).

  • If a dictionary, it must contain the ‘name’ key and other initialization parameters.

**kwargs

Optional parameters for initializing the requested observation model when a string.

Returns:
AbstractDataframeDataReader

The desired observation model.

Raises:
LeaspyModelInputError

If model is not supported.

Parameters:

reader (DataframeDataReaderFactoryInput)

Return type:

AbstractDataframeDataReader

class IndividualData(idx)[source]

Container for an individual’s data

Parameters:
idxIDType

Unique ID

Attributes:
idxIDType

Unique ID

timepointsnp.ndarray [float]

Timepoints associated with the observations 1D array

observationsnp.ndarray [float]

Observed data points, Shape is (n_timepoints, n_features)

cofactorsdict [FeatureType, Any]

Cofactors in the form {cofactor_name: cofactor_value}

event_timefloat

Time of an event, if the event is censored, the time correspond to the last patient observation

event_boolbool

Boolean to indicate if an event is censored or not: 1 observed, 0 censored

Parameters:

idx (IDType)

idx: IDType
timepoints: ndarray = None
observations: ndarray = None
event_time: ndarray | None = None
event_bool: ndarray | None = None
cofactors: dict[FeatureType, Any]
covariates: ndarray | None = None
add_observations(timepoints, observations)[source]

Include new observations and associated timepoints

Parameters:
timepointsarray-like [float]

Timepoints associated with the observations to include, 1D array

observationsarray-like [float]

Observations to include, 2D array

Raises:
LeaspyDataInputError
Parameters:
Return type:

None

add_event(event_time, event_bool)[source]

Include event time and associated censoring bool

Parameters:
event_timefloat

Time of the event

event_boolfloat

0 if censored (not observed) and 1 if observed

Parameters:
Return type:

None

add_covariates(covariates)[source]

Include covariates

Parameters:
covariatesarray-like [float]

Covariates to include, 2D array

Parameters:

covariates (list[list[int]])

Return type:

None

add_cofactors(cofactors)[source]

Include new cofactors

Parameters:
cofactorsdict [FeatureType, Any]

Cofactors to include, in the form {name: value}

Raises:
LeaspyDataInputError
LeaspyTypeError
Parameters:

cofactors (dict[FeatureType, Any])

Return type:

None

to_frame(headers, event_time_name, event_bool_name, covariate_names)[source]

Convert the individual data to a pandas DataFrame

Parameters:
headerslist [str]

List of feature names for the observations

event_time_namestr

Name of the column for the event time

event_bool_namestr

Name of the column for the event boolean (0 or 1)

covariate_nameslist [str]

List of covariate names

Returns:
pd.DataFrame
DataFrame containing the individual’s data with the following columns:
  • ID: Unique identifier for the individual

  • TIME: Timepoints associated with the observations

  • Observations: Observed data points for each feature

  • Event Time: Time of the event (if any)

  • Event Boolean: Boolean indicating if the event was observed (1) or censored (0)

  • Covariates: Values of the covariates for the individual

Parameters:
  • headers (list)

  • event_time_name (str)

  • event_bool_name (str)

  • covariate_names (list[str])

Return type:

DataFrame

class JointDataframeDataReader(*, event_time_name='EVENT_TIME', event_bool_name='EVENT_BOOL', nb_events=None)[source]

Bases: leaspy.io.data.abstract_dataframe_data_reader.AbstractDataframeDataReader

Methods to convert pandas.DataFrame to Leaspy-compliant data containers for event data and longitudinal data.

Parameters:
event_time_name: str

Name of the columns in dataframe that contains the time of event

event_bool_name: str

Name of the columns in dataframe that contains if the event is censored of not

Raises:
LeaspyDataInputError
Parameters:
  • event_time_name (str)

  • event_bool_name (str)

  • nb_events (Optional[int])

tol_diff = 0.001
visit_reader
event_reader
property event_time_name: str

Name of the event time column in dataset

Return type:

str

property event_bool_name: str

Name of the event bool column in dataset

Return type:

str

property dimension: int | None

Number of longitudinal outcomes in dataset.

Return type:

Optional[int]

property long_outcome_names: list[FeatureType]

Name of the longitudinal outcomes in dataset

Return type:

list[FeatureType]

property n_visits: int

Number of visit in the dataset

Return type:

int

class VisitDataframeDataReader[source]

Bases: leaspy.io.data.abstract_dataframe_data_reader.AbstractDataframeDataReader

Methods to convert pandas.DataFrame to Leaspy-compliant data containers for longitudinal data only. Raises —— LeaspyDataInputError

property dimension: int | None

Number of longitudinal outcomes in dataset.

Returns:
: int

Number of longitudinal outcomes in dataset

Return type:

Optional[int]