PredictMod ML Pipeline Tutorial

From HIVE Lab
Revision as of 19:46, 12 March 2025 by Lorikrammer (talk | contribs) (Created page with "<small> Go Back to PredictMod Project. </small><h1>Integration of a Machine Learning-based Approach for Predictive Clinical Decision-making using Python</h1> <h2>Summary</h2> <h3>Part I. Machine Learning Using Python</h3> <ol> <li>What is Python?</li> <li>Objectives</li> <li>Methodology of the Machine Learning Algorithms</li> <li>Software Installation</li> <li>Downloading the Input Files for Synthetic Da...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Go Back to PredictMod Project.

Integration of a Machine Learning-based Approach for Predictive Clinical Decision-making using Python

Summary

Part I. Machine Learning Using Python

  1. What is Python?
  2. Objectives
  3. Methodology of the Machine Learning Algorithms
  4. Software Installation
  5. Downloading the Input Files for Synthetic Data Generation
  6. Downloading the Input Files for Model Training

Part II. Using Python scripts to detect signal difference in the Electronic Healthcare Records of responsive and unresponsive patients

  1. Process
  2. Interpreting the Results
  3. Further Analysis and Next Steps

Part I. Machine Learning Using Python

1. What is Python?

Python is a versatile programming language that supports multiple programming paradigms, including procedural, object-oriented, and functional programming. It is widely used for tasks such as data manipulation, web development, scientific computing, and automation. Python’s extensive standard library and external packages make it particularly useful for data analysis, machine learning, and visualization. Through libraries like NumPy, pandas, matplotlib, and scikit-learn, Python excels at handling large datasets, building models, and visualizing results. Additionally, Python can easily interface with programs written in other languages and supports the integration of a wide range of toolkits to extend its functionality.

2. Objectives

The general purpose of this protocol is to provide proof-of-concept through a Python workflow for creating predictive machine learning models using some form of data. In this tutorial, patient data will serve as an example input into the system while the output will determine whether the patient is a responder or a non-responder to the treatment assigned. The concepts in this tutorial will be applicable to most binary classification datasets for future model development.

Two major machine learning concepts will be applied to this system as follows:

  1. Create a synthetic data set to be used as an input to a machine learning model to ensure consistency during the model training steps
  2. Input patient data through a series of machine learning classification models to predict whether or not a treatment is effective before dietary or medical intervention (e.g. responder vs. non-responder)

This tutorial utilizes patient data provided by <a href="https://synthea.mitre.org/downloads">Synthea</a>. The process of how Synthea data was retrieved and filtered using MATLAB can be found in this <a href="https://docs.google.com/document/d/1yfUjoaU0lfTx8blTCgZehR7Qdn0C0iTU3VTPAag9ITI/edit?usp=sharing">link</a>. This tutorial has its own retrieve and filter process written in Python that we will use. If interested, the full synthetic generation process can be found at this <a href="https://github.com/GW-HIVE/PredictMod/tree/main/flask_backend/models/Diabetes_EHR_v1">link</a>.

3. Methodology of the Machine Learning Algorithms

1. Generating Synthetic Data: In order to generate synthetic data, the covariance, standard deviation, and mean are calculated for each variable (BMI, glucose etc.) from the patient dataset. The algorithm will also designate whether the variable is continuous or discrete. A “noise” data set is then generated based on these statistical calculations. This “noise” data is then refined when two classifier neural networks compete to label the data appropriately based on the training set. Once the synthetic data is labeled appropriately, it is stored in a matrix and this process is repeated until a sufficient number of values are generated. This algorithm is similar to a Generational Adversarial Network (GAN). More information regarding how GANs work can be found through the following <a href="https://en.wikipedia.org/wiki/Generative_adversarial_network">link</a>. The purpose of the synthetic data generation step in this tutorial is to ensure that the multiple models we will apply to the dataset can handle the input data. Some traditional machine learning models cannot handle NAN/NULL values due to the mathematical operations involved in the process. Languages such as MATLAB that have built-in toolkits that account for this can take years to develop. Python does not have a library capable of this, so to avoid this issue, we can generate synthetic data that avoids this issue.

2. Classifying Training Data: A classification system utilizes a neural network or decision tree to create a binary classifier that can predict whether or not a patient (Row of data from the dataset) will be responsive (Label) to the standard Type II diabetes intervention plan. This plan involves non-invasive lifestyle changes such as diet and exercise. There are two identifiable classes for pre-diabetic individuals who follow the intervention plan: responders and non-responders. Responders are individuals who remain at prediabetic levels or return to normal levels while non-responders are individuals who develop diabetic levels after following the intervention plan. The machine learning algorithm is provided with a fraction of the original patient dataset, known as the “training set”, and trains the model to then be able to predict new patient data (“test set”) without knowing its label.

4. Software Installation

Software Installation Requirements for running Machine Learning Algorithms. This guide provides step-by-step instructions on how to set up your environment for running the machine learning algorithm. Follow the instructions below to ensure that the necessary libraries and software are installed correctly.

NOTE: There may be new versions of the software and libraries from when this tutorial was written. If you run into any issues with functions or methods being unusable, troubleshoot on forums such as stackoverflow or use older versions of software.

  1. Install Python: First, ensure that Python is installed on your system. This tutorial uses Python 3.11. You can download Python from the official <a href="https://www.python.org/downloads/">website</a>
  2. Install VScode from the official <a href="https://code.visualstudio.com/download">website</a>. Ensure you are using the correct version based on your Operating System (O.S).
  3. Configure VSCode to be able to run Python following the instructions available on the official website: <a href="https://code.visualstudio.com/docs/python/python-tutorial">https://code.visualstudio.com/docs/python/python-tutorial</a>. It is highly recommended to include Pylance: an extension that works alongside Python in Visual Studio Code to provide performant language support. Pylance can be added by opening Visual Studio Code, clicking on the extensions on the left-hand side and search for Pylance, and installing it.
  4. Pip Install Python libraries required for this tutorial. PIP is the package installer for Python for libraries not included in the default python package. Run the following commands:
    pip install tensorflow imageio matplotlib numpy pandas scikit-learn
    If for any reason Pylance cannot resolve an import of one of the required libraries, check which libraries you have installed by typing pip list in Visual Studio Code

5. Downloading the Input Files for Synthetic Data Generation

Download the required material for this tutorial from this <a href="https://drive.google.com/drive/folders/1U-TIZe-Iqmziijiiw-1VHZNaGhIXUerQ?usp=drive_link">link</a>. The Synthetic Generation folder contains the Python file and input excel files required for generating the dataset that we will use.

  • Python Project Materials List:
    • Synthetic_EHR_data_diabetes.py: A Python script that generates synthetic Electronic Health Record (EHR) data for diabetes-related studies, using GAN techniques
  • Excel Data Files:
    • label_non_responsive.xlsx: Contains labels for non-responsive patient data
    • label_responsive.xlsx: Contains labels for responsive patient data
    • data_non_responsive.xlsx: Contains observational data related to patients non-responsive to the treatment
    • data_responsive.xlsx: Contains observational data related to patients non-responsive to the treatment
    • var_list_.xlsx: A list of variables or features that are present in the dataset. This can be used to identify key variables of interest during analysis.
  • Documentation:
    • README.md: A markdown file providing an overview of the project, explaining the purpose of the scripts, and instructions on how to use the code and data.

6. Downloading the Input Files for Model Training

Download the required material for this tutorial from this link. The model training folder contains the Python file and input excel files required for testing multiple models and providing performance metrics.

Part II. Using Python scripts to detect signal difference in the Electronic Healthcare Records of responsive and unresponsive patients

Process

  1. Create GAN model that will take input data and generate synthetic data within the same distribution of the initial data but without NAN/NULL values to avoid errors in later steps.
  2. Run multi-model analysis python script that tests a wide variety of machine learning models and outputs the accuracy and RMSE (Root Mean Squared Error) for each.

Step 1: Creating Synthetic Data

  1. Open Visual Studio Code (VSCode) and go to the top-left corner, click on File → Open Folder.
  2. Navigate to the folder where you’ve downloaded the Synthetic_EHR_data_diabetes.py script.
  3. Once the folder is open, you should see the file explorer in VSCode.
  4. Double-click on the Synthetic_EHR_data_diabetes.py file to open it in the center of the page.
  5. In the top-right of VSCode, click the Run Python button to execute the script.
  6. Ensure that you are generating synthetic data for both the responsive and non-responsive datasets. You can achieve this by adjusting the files the script is reading:
    • Change data_responsive.xlsx to data_non_responsive.xlsx.
    • Do the same for label_responsive.xlsx to label_non_responsive.xlsx.
  7. Adjust the output file name at line 167 of the script to differentiate between the responsive and non-responsive datasets:
    • For the responsive dataset: df.to_excel('EHR_responsive_at_epoch_{:04d}.xlsx'.format(epoch))
    • For the non-responsive dataset: df.to_excel('EHR_non_responsive_at_epoch_{:04d}.xlsx'.format(epoch))
  8. After running the script, concatenate the two synthetic files (responsive and non-responsive) and make sure the response column is populated correctly:
    • 1 for the responsive dataset.
    • 0 for the non-responsive dataset.

Step 2: Running the Multi-Model Analysis

The multi-model analysis consists of the following steps:

  1. Load the dataset and extract X (features) and y (labels).
  2. Split the data into training and testing sets.
  3. Define a list of machine learning models to test.
  4. Train each model and evaluate its performance using accuracy and RMSE.
  5. Print the results for each model.

Once you run the script, the performance metrics (accuracy and RMSE) will be printed in the terminal of VSCode. These metrics will provide a baseline to identify the most effective model for further analysis, including parameter and hyperparameter tuning.

Interpreting the Results:

1. Logistic Regression:

  • Accuracy: ~89.5%
  • RMSE: Moderate
  • Misclassification: 21 incorrect predictions (11 false positives and 10 false negatives).
  • Next Steps: Consider feature scaling, regularization, and hyperparameter tuning.

2. Decision Tree:

  • Accuracy: 99%
  • RMSE: Low
  • Misclassification: 2 false negatives.
  • Next Steps: Prune the tree to avoid overfitting. Hyperparameter tuning and ensemble techniques like boosting or bagging could help further.

3. Random Forest:

  • Accuracy: 98%
  • RMSE: Low
  • Misclassification: 4 samples misclassified.
  • Next Steps: Hyperparameter tuning (number of trees, maximum depth, etc.) and feature importance analysis.

4. Gradient Boosting:

  • Accuracy: 99.5%
  • RMSE: Lowest among the models.
  • Misclassification: 1 false negative.
  • Next Steps: Tune learning rate, number of boosting stages, and depth. Dimensionality reduction may also help.

5. K-Nearest Neighbors (KNN):

  • Accuracy: 94.5%
  • RMSE: 0.2345 (slightly higher compared to tree-based methods).
  • Misclassification: 11 false negatives.
  • Next Steps: Scale the data, tune the number of neighbors (k), and explore distance metrics.

6. Support Vector Machine (SVM):

  • Accuracy: 81.5%
  • RMSE: 0.4301 (highest RMSE).
  • Misclassification: Significant number of false negatives (37).
  • Next Steps: Tune kernel type, C parameter, and gamma. Try dimensionality reduction techniques.

7. Extra Trees:

  • Accuracy: 97.5%
  • RMSE: 0.1581.
  • Misclassification: 5 false negatives.
  • Next Steps: Hyperparameter tuning (number of trees, depth), ensemble techniques, or feature selection.

8. AdaBoost:

  • Accuracy: 98.5%
  • RMSE: 0.1225 (low RMSE).
  • Misclassification: Only 3 misclassified samples.
  • Next Steps: Tune learning rate and number of estimators. Consider ensemble techniques or cross-validation.

Further Analysis and Next Steps:

While most models performed exceptionally well, additional techniques could further enhance results:

  1. Hyperparameter Tuning: Fine-tuning model parameters using techniques like Grid Search or Random Search will likely yield improvements. Parameters like learning rate, depth, number of estimators, and regularization strength could be optimized.
  2. Cross-Validation: Apply k-fold cross-validation for a more reliable estimate of model performance and to avoid overfitting on the test set.
  3. Feature Selection and Dimensionality Reduction: Implement Principal Component Analysis (PCA) or feature selection methods to reduce noise, improve computation efficiency, and enhance predictive power, especially for models like KNN and SVM.
  4. Handling Class Imbalance: Techniques like SMOTE or class weighting could be useful if class imbalance is present in the dataset.

By applying these further techniques, we can continue to refine the model performance, increase predictive accuracy, and reduce error metrics across the board.

</body> </html>