4. Algorithms¶

4.1. Fit¶

In this section we describe how to fit a leaspy model with your data. Leaspy uses the MCMC-SAEM algorithm to fit a model by jointly estimating the fixed effects and the distribution of the random effects. It is particularly well suited to this kind of models where the likelihood involves latent variables and is not available in closed form.

The algorithm is an adaptation of the Expectation-Maximisation (EM) algorithm that relies on an iterative procedure that alternates between the following main steps:

Expectation/Stochastic Approximation Step: the algorithm uses Markov Chain Monte Carlo (MCMC) to generate samples of the latent variables (random effects) conditional on the current parameter estimates. In particular, Gibbs sampling is employed, which iteratively updates each latent variable conditional on the current values of the others, allowing efficient exploration of the latent space. To avoid convergence to local maxima, a temperature scheme is applied: the sampling distribution is initially “flattened” during the burn-in phase, to allow exploration of a wider range of values, and the temperature is gradually reduced over iterations so that the chain focuses increasingly on high-likelihood regions. The sufficient statistics of the complete-data log-likelihood are then computed using a stochastic approximation scheme.
Maximization Step: Given the updated sufficient statistics, the fixed effects and variance components are re-estimated by maximizing the approximate complete-data log-likelihood.

By iterating these steps, the MCMC-SAEM algorithm converges to the maximum likelihood estimates of the model parameters.

4.1.1. Prerequisites¶

Depending on the model you want to fit, you need a dataframe with a specific structure (see logistic, joint, and mixture models).

4.1.2. Running Task¶

First, choose one of the the existing models in leaspy.models. The model you select determines the expected structure of your dataset (see Prerequisites).

Let’s use the logistic model as an example.

from leaspy.models import LogisticModel

We need to specify the arguments name, dimension (the number of outcomes \(K\) in your dataset) and the obs_models (valid choices for the logistic model are ‘gaussian-diagonal’ to estimate one noise coefficient per outcome or ‘gaussian-scalar’ to estimate one noise coefficient for all the outcomes). When we fit a multivariate model we also need to specify source_dimension that corresponds to the degrees of freedom of intermarker spacing parameters. We refer you to the mathematical background section for more details. We generally suggest a number of sources close to the square root of the number of outcomes (\(\sqrt{dimension}\)).

You can also add a seed or control other arguments for the output and the logs like save_periodicity, path, etc.

model = LogisticModel(name="my-model", source_dimension=1, dimension=2, obs_models='gaussian-diagonal')
model.fit(data_leaspy, "mcmc_saem", n_iter=20000, seed=42)

Note that the joint and mixture models require additional model-specific arguments. Please refer to their respective documentation for details: joint model and mixture model.

4.1.3. Output¶

Once the iterations are done we can see the parameters that were estimated by the model and save them in a dedicated file.

model.save('my_path/my_model/model_parameters.json')
model.parameters

And we can also plot the estimated average trajectory.

import matplotlib.pyplot as plt
from leaspy.io.logs.visualization.plotting import Plotting
leaspy_plotting = Plotting(model)

ax = leaspy_plotting.average_trajectory(
    alpha=1, figsize=(14, 6), n_std_left=2, n_std_right=8
)
plt.show()

4.2. Personalize¶

After fitting the model, Leaspy provides a personalization step in which random effects are estimated for each subject, conditional on the previously learned fixed effects. This section explains how these random effects can be obtained from the available longitudinal observations of each subject. By estimating these individual parameters, Leaspy characterizes each patient’s trajectory and enables predictions of their future progression. Leaspy provides two main approaches to perform this personalization step.

More of a frequentist one: random effects are estimated using the solver minimise from the package Scipy [41] to maximize the likelihood knowing the fixed effects.

>>> personalize_settings = AlgorithmSettings("scipy_minimize", seed=0)
>>> algorithm = algorithm_factory(personalize_settings)
>>> individual_parameters = algorithm.run(model, dataset)
==> Setting seed to 0
|##################################################|   200/200 subjects
Personalize with `AlgorithmName.PERSONALIZE_SCIPY_MINIMIZE` took: 29s
The standard-deviation of the noise at the end of the AlgorithmType.PERSONALIZE is: 6.85%
>>> individual_parameters.to_dataframe()
        sources_0  sources_1        tau        xi
ID
GS-001   0.027233  -0.423354  76.399582 -0.386364
GS-002  -0.262680   0.665351  75.137474 -0.577525
GS-003   0.409585   0.844824  75.840012  0.102436
GS-004   0.195366   0.056577  78.110085  0.433171
GS-005   1.424637   1.054663  84.183098 -0.051317
...           ...        ...        ...       ...
GS-196   0.961941   0.389468  72.528786  0.354126
GS-197  -0.561685  -0.720041  79.006042 -0.624100
GS-198  -0.061995   0.177671  83.596138  0.201686
GS-199   1.932454   1.820023  92.275978 -0.136224
GS-200   1.152407  -0.171888  76.504517  0.770118

[200 rows x 4 columns]

More of a bayesian one: random effects are estimated using a Gibbs sampler with an option on the burn-in phase and temperature scheme (see fit description). Currently, the package enables to extract the mean or the mode of the posterior distribution. They can be used with the same procedure using mean_posterior or mode_posterior flag.

>>> personalize_settings = AlgorithmSettings("mean_posterior", seed=0)
>>> algorithm = algorithm_factory(personalize_settings)
>>> individual_parameters = algorithm.run(model, dataset)
==> Setting seed to 0
|##################################################|   1000/1000 iterations

Personalize with `mean_posterior` took: 1s
>>> individual_parameters.to_dataframe()
        sources_0  sources_1        tau        xi
ID
GS-001   0.027233  -0.423354  76.399582 -0.386364
GS-002  -0.262680   0.665351  75.137474 -0.577525
GS-003   0.409585   0.844824  75.840012  0.102436
GS-004   0.195366   0.056577  78.110085  0.433171
GS-005   1.424637   1.054663  84.183098 -0.051317
...           ...        ...        ...       ...
GS-196   0.961941   0.389468  72.528786  0.354126
GS-197  -0.561685  -0.720041  79.006042 -0.624100
GS-198  -0.061995   0.177671  83.596138  0.201686
GS-199   1.932454   1.820023  92.275978 -0.136224
GS-200   1.152407  -0.171888  76.504517  0.770118

[200 rows x 4 columns]

4.3. Estimate¶

This sections describes the procedure for estimating a patient’s trajectory. Once the personalization is performed then we can estimate and visualize the trajectory of a specific subject using its individual parameters.

import numpy as np

observations = dataframe.loc["GS-187"]
print(f"Seen ages: {observations.index.values}")
print("Individual Parameters : ", ip["GS-187"])

timepoints = np.linspace(60, 100, 100)
reconstruction = model.estimate({"GS-187": timepoints}, ip)

The reconstruction object contains the estimated outcome values for the individual and we can plot them along with the actual observations.

ax = leaspy_plotting.patient_trajectories(
    data_leaspy,
    ip,
    patients_idx=["GS-187"],
    labels=["OUTCOME_1", "OUTCOME_2"],
    alpha=1,
    linestyle="-",
    linewidth=2,
    markersize=8,
    obs_alpha=0.5,
    figsize=(16, 6),
    factor_past=0.5,
    factor_future=5,
)
ax.set_xlim(45, 80)
plt.show()

4.4. Simulate¶

This section describes the procedure for simulating new patient data under the Spatiotemporal model structure. The simulation method, relying on a fitted Leaspy model and user-defined parameters, involves the following steps:

Step 1: Generation of Individual Parameters
For each simulated patient, individual parameters (\(\tau_i\), \(\xi_i\), and the sources) are sampled from normal distributions defined by the model’s mean and standard deviation:

\(\xi_i \sim \mathcal{N}\left(0, \sigma^2_{\xi}\right)\),
\(\tau_i \sim \mathcal{N}\left(t_0, \sigma^2_{\tau}\right)\),
\(N_s \text{ sources } s_{i,m} \sim \mathcal{N}\left(\bar{s}, \sigma_s\right)\)

These model parameters come from a previously fitted Leaspy model, provided by the user.

Step 2: Generation of Visit Times
Visit times are generated based on user-specified visit parameters, such as the number of visits, spacing between visits, and follow-up duration. These parameters are provided through a dictionary.

Step 3: Estimation of Observations
The estimate function from Leaspy is used to compute the patients observation at the generated visit time \(t_{i,j,k}\), based on the individual parameters:
\(y_{i,j,k} = \gamma_{i,k}(t_{i,j,k}) + \epsilon_{i,j,k}, \quad \epsilon_{i,j,k} \sim \mathcal{N}(0, \sigma^2_k)\)

To reflect variability in the observations, beta-distributed noise is added, appropriate for modeling outcomes in a logistic framework.

4.4.1. Prerequisites¶

To run a simulation, the following variables are required:

A fitted Leaspy model (see the fit function), used for both parameter sampling (step 1) and the estimate function (step 3).
A dictionary of visit parameters, specifying the number, type, and timing of visits (used in step 2).
An AlgorithmSettings object, configured for simulation and including:
- The name of the outcomes to simulate.
- The visit parameter dictionary.

4.4.2. Running the Task¶

>>> from leaspy.algo import AlgorithmSettings
>>> visits_params = {
        'patient_nb': 200,
        'visit_type': "random",
        'first_visit_mean': 0.,
        'first_visit_std': 0.4,
        'time_follow_up_mean': 11,
        'time_follow_up_std': 0.5,
        'distance_visit_mean': 2 / 12,
        'distance_visit_std': 0.75 / 12,
        'distance_visit_min': 1/365
    }
>>> simulated_data = model.simulate( 
         algorithm="simulate", 
         outcomes=["MDS1_total", "MDS2_total", "MDS3_off_total"],
         visit_parameters= visits_params
    )
>>> print(simulated_data.data.to_dataframe().set_index(['ID', 'TIME']).head())
 ID  TIME  MDS1_total  MDS2_total  MDS3_off_total  SCOPA_total  MOCA_total  \
  0  63.0    0.130888    0.220548        0.186086     0.083651    0.088756   
     64.0    0.138080    0.039211        0.289588     0.034846    0.047147   
     65.0    0.228149    0.068744        0.151979     0.141604    0.131976   
     66.0    0.208679    0.112899        0.202224     0.192716    0.067183   
     67.0    0.290484    0.252141        0.255622     0.240425    0.115898   

   REM_total  PUTAMEN_R  PUTAMEN_L  CAUDATE_R  CAUDATE_L  
   0.555283   0.808789   0.685063   0.546174   0.467885  
   0.660931   0.758014   0.640209   0.541839   0.474202  
   0.766028   0.941519   0.738120   0.643509   0.549832  
   0.671021   0.796510   0.930209   0.657473   0.622322  
   0.791594   0.955246   0.844813   0.677306   0.638281  

4.4.3. Output¶

The output is a Data object with ID, TIME and simulated values of each outcome.

4.4.4. Setting options¶

There are three options to simulate the visit times in Leaspy, which can be specified in visit_param dictionary:

random: Visit times and intervals are sampled from normal distributions.
regular: Visits occur at regular intervals, defined by regular_visit.
dataframe: Custom visit times are provided directly via a DataFrame.

Refer to the docstring for further details.

4.5. References¶

[VGO+20]

Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020. doi:10.1038/s41592-019-0686-2.