Model Training and Validation: Difference between revisions
Lorikrammer (talk | contribs) Created page with "The initial cohort of models submitted along with the initial release of the PredictMod platform were built using a wide variety of techniques for handling every stage, from data ingestion through modeling. Such techniques included: K-nearest neighbors, Synthetic Neighbor Oversampling Technique (SMOTE), and Leave-One-Out-Cross-Validation (LOOCV) (for data sampling and harmonization); conditional Generative Adversarial Networks (cGANs), SMOTE, multivariate kernel density..." |
Lorikrammer (talk | contribs) mNo edit summary |
||
(2 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
The initial cohort of models submitted along with the initial release of the PredictMod platform were built using a wide variety of techniques for handling every stage, from data ingestion through modeling. Such techniques included: K-nearest neighbors, Synthetic Neighbor Oversampling Technique (SMOTE), and Leave-One-Out-Cross-Validation (LOOCV) (for data sampling and harmonization); conditional Generative Adversarial Networks (cGANs), SMOTE, multivariate kernel density estimations (KDEs), ESOM (Emergent Self-Organizing Maps), and sampling directly from synthetic databases, as in MDClone (for [[Augmenting real data with synthetic data|augmenting original data with synthetic data]]); confusion matrices, feature importances, and Principle Component Analysis (PCA) (for model training and validation). This list is not exhaustive, only representative of the techniques used across the spectrum of models included with the PredictMod platform release. At current, all Python-based models are supported. In future development, we aim to allow any technology or algorithm made freely available to us could supported in the program. | <small>Go Back to [[PredictMod|PredictMod Project]]. </small> | ||
== Supported Technologies and Algorithms == | |||
The initial cohort of models submitted along with the initial release of the PredictMod platform were built using a wide variety of techniques for handling every stage, from data ingestion through modeling. Such techniques included: K-nearest neighbors, Synthetic Neighbor Oversampling Technique (SMOTE), and Leave-One-Out-Cross-Validation (LOOCV) (for data sampling and harmonization); conditional Generative Adversarial Networks (cGANs), SMOTE, multivariate kernel density estimations (KDEs), ESOM (Emergent Self-Organizing Maps), and sampling directly from synthetic databases, as in MDClone (for [[Augmenting real data with synthetic data|augmenting original data with synthetic data]]); confusion matrices, feature importances, and Principle Component Analysis (PCA) (for model training and validation). This list is not exhaustive, only representative of the techniques used across the spectrum of models included with the PredictMod platform release. At current, all Python-based models are supported. In future development, we aim to allow any technology or algorithm made freely available to us could supported in the program. Additional information about each existing model can be found in the BCO documentation on the platform and associated GitHub repository. | |||
== Methods for Data Ingestion == | == Methods for Data Ingestion == |
Latest revision as of 18:42, 28 August 2025
Go Back to PredictMod Project.
Supported Technologies and Algorithms
The initial cohort of models submitted along with the initial release of the PredictMod platform were built using a wide variety of techniques for handling every stage, from data ingestion through modeling. Such techniques included: K-nearest neighbors, Synthetic Neighbor Oversampling Technique (SMOTE), and Leave-One-Out-Cross-Validation (LOOCV) (for data sampling and harmonization); conditional Generative Adversarial Networks (cGANs), SMOTE, multivariate kernel density estimations (KDEs), ESOM (Emergent Self-Organizing Maps), and sampling directly from synthetic databases, as in MDClone (for augmenting original data with synthetic data); confusion matrices, feature importances, and Principle Component Analysis (PCA) (for model training and validation). This list is not exhaustive, only representative of the techniques used across the spectrum of models included with the PredictMod platform release. At current, all Python-based models are supported. In future development, we aim to allow any technology or algorithm made freely available to us could supported in the program. Additional information about each existing model can be found in the BCO documentation on the platform and associated GitHub repository.
Methods for Data Ingestion
Data Dictionaries and Mapping
Proper data ingestion practices are crucial in order to ensure accurate and comprehensive data collection. This was achieved by leveraging data dictionaries and mapping for each curated dataset. The data dictionaries developed for PredictMod utilize Logical Observation Identifiers Names and Codes (LOINC) as their primary ontological identifier, given its wide use regarding health measurements, alongside the International Classification of Diseases, 10th Revision (ICD-10), Systemized Nomenclature of Medicine (SNOMED), and Current Procedural Terminology (CPT) codes. Recognizing the potential for multiple terminologies for a single measurement, we engaged in thorough research and collaborated closely with clinical professionals to identify the most appropriate code for each measurement. The data mapping file helps maintain consistency by clearly defining relationships between data elements from different sources or formats to ensure accurate integration and interpretation across datasets. Leveraging data dictionaries and data mapping was particularly beneficial as we harmonized information from multiple resources, including omics datasets, EMR platforms, scientific literature, and synthetic patient databases.
Data Quality Control
A critical element of our data ingestion protocol is the Quality Control (QC) of raw data. This step is crucial to improve the reliability of model training, as it helps to eliminate errors, reduce noise, and ensure the accuracy and consistency of the dataset. Data QC typically begins with assessing null values, evaluating class imbalance, and identifying extraneous variables. In addition to the aforementioned tools, collaboration with medical experts was essential in determining whether to exclude certain extraneous variables. QC procedures also emphasize the removal of censored data to reduce noise and ensure that data types, units, and value ranges align with the standards defined in the data dictionary and mapping documents.
Use of MDClone synthetic data source
MDClone ADAMS Platform is a self-service data analytics platform that has a collaboration with the VA and enables healthcare collaboration, research, and innovation. It provides synthetic datasets that are based on real patient data. Synthetic data comprises a novel population with similar statistical characteristics and intervariable correlations as the population data without including any original patient data points. This allows for data transfer internally and externally without compromising any sensitive personally identifiable information (PII). The synthetic dataset is used to train advanced algorithms within the PredictMod application, enabling predictions of treatment outcomes for patients with preDM or T2DM in response to interventions that typically include lifestyle changes or medication. All synthetic datasets are statistically similar to the source VA patient datasets housed in the MDClone ADAMS Platform, and a comparison report is carefully reviewed as part of the data harmonization process.
The process of query building involves creating a cohort of interest. The MDClone Diet Counseling cohort included patients at least 18 years old with a preDM diagnosis and a Body Mass Index (BMI) at or above 30. For individuals that satisfied these criteria and had undergone drug or lifestyle intervention, variables of interest including weight, BMI, blood pressure, and Hemoglobin A1c (HbA1C) collected less than six months prior to intervention were added to the queries and extracted. If multiple lab results within the six-month window were available, the most recent one prior to the start of the intervention was used. Variables of interest were identified and incorporated through targeted keyword searches and LOINC code mapping to enhance data capture. A parameter was also set to extract the patients’ weight 9-12 months following intervention and identify those who experienced a 5% reduction in weight from baseline (“Responders”).
Methods for Model Training and Validation
Our model training workflow for each model involves using refined EMR and omics datasets to uncover predictive signals indicative of intervention outcomes before their commencement. An initial training phase using a stratified data split is conducted to observe the models' decision-making patterns before additional optimization or tuning. Further, if needed, hyperparameter tuning is conducted to account for overfitting and to optimize generalizability. Model QC involves validating each model against the current literature and guidance by medical and ML experts. Techniques such as feature importance analysis and examination of other metrics of explainable AI are also employed to assess the alignment of each model's output decisions with current scientific understanding. After training, validation, and thorough documentation of the model pipeline, the models are integrated into the PredictMod platform. The corresponding READMEs and BCOs are then presented directly in the “View” page for each model, as well as made available in the PredictMod GitHub repository.
For example, for our “prediabetes proteomics” model, we employed a logistic regression classifier (C=100, penalty=‘l1’, solver=‘liblinear’, max_iter=1000, class_weight=“balanced”) and validated it using a 20% test split. Model performance was assessed through accuracy, F1 score, recall, precision, and area under the ROC curve (AUC), with particular attention paid to avoiding overfitting. Initial experiments with oversampling via SMOTE led to unstable results, producing variable F1 scores between 0.83 and 0.92 on the test set and AUC values ranging from 0.86 to 0.93, likely due to the stochastic nature of logistic regression. Similar variability in model performance following oversampling with SMOTE and related methods has also been reported in prior studies. In contrast, augmentation through Gaussian noise injection has been shown to be better suited for certain datasets, yielding improved stability and reproducibility of model performance. Consistent with these findings, we observed that Gaussian noise augmentation provided the most reliable results in our case, with training and testing F1 scores of 1.0 and 0.92, respectively, and a consistent AUC of 0.92. Data preprocessing included normalization, harmonization of input features, and quality checks against low-quality or inconsistent entries. The model’s feature importance aligned closely with previously published findings, particularly highlighting proteins such as SELE, MICB/MICA, LPL, CLEC4AC, IL17F, ICAM5, ACE2, NRP1, LGALS4, and IL3RA, many of which are implicated in glycosylation and immune regulation. The prominence of these proteins is biologically plausible, as the cohort analyzed consisted of prediabetic patients undergoing exercise interventions aimed at improving metabolic and immune function.