Next-Generation Multicenter Studies: Using Artificial Intelligence to Automatically Process Unstructured Health Records of Patients with Lung Cancer across Multiple Institutions
Menée à partir de données d'une cohorte française portant sur 311 patients atteints d'un cancer du poumon, cette étude évalue la précision de l'intelligence artificielle pour extraire automatiquement des données de dossiers médicaux non structurés issus de plusieurs établissements
Background : Manual abstraction of real-world data (RWD) from unstructured health records (HRs) remains resource-intensive, error-prone, and highly variable across institutions. Large language models (LLMs) offer a scalable alternative, but their performance in multicenter oncology settings is not fully validated.
Patients and Methods : We conducted a multicenter study within the French Large & Unified Cancer Cohort (LUCC) consortium to compare the accuracy of artificial intelligence (AI)-based data extraction against manual abstraction by clinical research professionals. A fine-tuned LLM was applied to de-identified unstructured HRs in PDF format to extract 31 variables from lung cancer patients across 10 centers. Ground truth was defined as concordant values across sources, with discrepant cases adjudicated by a blinded expert. The primary endpoint was the extraction error rates. Secondary endpoints included per-variable performance, inter-institutional variability, F1-score for multiple-choice variables, added value of hybrid AI–human workflows, and survival analyses.
Results : Among 10,327 patients with AI-based extraction, 311 were included in the test cohort. Across 8,708 datapoints for 28 variables with only one correct answer, the LLM achieved a 7.0% error rate, outperforming manual abstraction (14.2%, p<0.001). The F1-scores of 3 multiple-choice variables were superior (gene alterations 0.97 vs 0.86, comorbidities 0.86 vs 0.76, metastatic sites 0.71 vs 0.69). Inter-institutional variance was lower with AI (0.12% vs 0.39%). A hybrid approach with targeted human review of 30% of low-confidence AI outputs further decreased error rates to 4.4%. Survival analyses based on AI-extracted data closely matched ground truth, with similar median overall and progression-free survival.
Conclusions : In a multicenter setting, our AI pipeline yielded lower error rates and greater consistency than manual abstraction. These findings support the feasibility of next-generation, AI-enabled multicenter studies to generate high-quality RWD at scale, with potential applicability in prospective clinical trials.
Annals of Oncology , article en libre accès, 2025