Data scarcity is a major bottleneck for applying deep learning in healthcare, particularly for rare diseases, where small patient populations and privacy constraints limit data availability and sharing. Addressing this challenge, Ruben Branco, LASIGE’s PhD student, published a paper entitled “PatientFlow: Learning to Generate Mixed-Type Longitudinal Clinical Data with Flow Matching” in Artificial Intelligence in Medicine (Elsevier), co-authored by Sara C. Madeira, LASIGE integrated member, Marta Gromicho and Mamede de Carvalho from the Faculty of Medicine of the University of Lisbon, and Piero Fariselli from the University of Torino.
The paper introduces PatientFlow, a generative modeling method that combines Variational Autoencoders with Flow Matching to synthesize realistic longitudinal clinical data containing both static patient features (e.g., demographics) and temporal assessments (e.g., clinical scores over time) with mixed data types. The method was extensively evaluated on a cohort of 1,560 patients with Amyotrophic Lateral Sclerosis (ALS), a rare neurodegenerative disease, using a comprehensive evaluation framework that includes statistical testing, a novel similarity metric for mixed-type longitudinal data, semantic rule verification, expert clinical evaluation, and privacy risk assessment.
The results demonstrate that prognostic models trained on PatientFlow-generated synthetic data matched or outperformed those trained on real data in four out of five clinically relevant endpoints, while an expert ALS clinician achieved near-chance accuracy (54.2%) when distinguishing synthetic patients from real ones. Privacy risk remained below recommended thresholds, showing that PatientFlow can effectively generate high-fidelity clinical data, opening promising avenues for secure data sharing and dataset augmentation for deep learning in healthcare.
