Modeling Tutorials

Step-by-Step Tutorials

The following tutorials are recommended for those interested in creating and submitting models to the PredictMod platform.

EHR ML Tutorial

Metagenomic ML Tutorial

Recommended Pipeline Steps

Please use the existing tutorials on the wiki and supplement your pipeline with the information below. Please indicate which of this content, if any, is more useful than what is online so that we can revise accordingly.

Environment Setup

Install VS Code, Miniconda, and Python (3.11).
Make sure relevant packages are imported:
1. REQUIRED: scikit-learn, matplotlib, numpy, pandas, shap, xgboost, seaborn, and joblib packages.

Data Preparation

Import the dataset. The ML-ready dataset should:
1. Include all required variables for model training
2. Include a clear R/NR column
Run initial descriptive statistics
Split the data into train/test
Handle missing data (if needed)
1. Simple imputation (median/mode) should be sufficient for first-time modelers, KNN is recommended but more complex.
Conduct feature selection
1. Select a method (I recommend stepwise; random forest (RF) or logistic regression (LOGR))
2. Select a threshold (I recommend 5% or 1%)

Modeling & Evaluation

Select 3 model algorithms for training and comparison (I recommend RF, gradient boosting (XGB), & LOGR)
1. Decision tree classifier (DTC) & support vector machine (SVM) algorithms are also fine
Record accuracy, AUROC, f1, precision, recall, and confusion matrices for each model.
Generate a SHAP summary plot to visualize feature importance
OPTIONAL: conduct hyperparameter tuning using random search, grid search, cross validation, etc.

Single-Patient Predictions

Once trained, create a separate .py script to run predictions using the model of choice.
1. Save the fitted scaler, trained model, and feature names to pickle files using joblib, then load them into the new file.
Check for missing data, extra columns, and reorder columns as needed to match the training dataset.
Apply the saved scaler using .transform() only (not fit.transform())
Generate a class prediction and print the results clearly.
1. OPTIONAL: generate a probability score in addition to the class prediction
Validate the script by testing several single-patient predictions.

Documentation

Each step of your script should include a comment indicating the intended function.
Each model must be accompanied by a BioCompute Object.

Modeling Tutorials

Contents

Step-by-Step Tutorials

Recommended Pipeline Steps

Environment Setup

Data Preparation

Modeling & Evaluation

Single-Patient Predictions

Documentation

Navigation menu

Modeling Tutorials

Step-by-Step Tutorials

Recommended Pipeline Steps

Environment Setup

Data Preparation

Modeling & Evaluation

Single-Patient Predictions

Documentation

Navigation menu

Search