Modeling Tutorials: Difference between revisions
Jump to navigation
Jump to search
Lorikrammer (talk | contribs) |
Lorikrammer (talk | contribs) |
||
| Line 27: | Line 27: | ||
## Simple imputation (median/mode) should be sufficient for first-time modelers, KNN is recommended but more complex. | ## Simple imputation (median/mode) should be sufficient for first-time modelers, KNN is recommended but more complex. | ||
# Conduct feature selection | # Conduct feature selection | ||
## Select a method ( | ## Select a method (stepwise selection using random forest (RF) or logistic regression (LOGR) is recommended) | ||
## Select a threshold ( | ## Select a threshold (a maximum threshold of 10 or 15, or a percentage threshold of 5% or 1% is recommended) | ||
=== Modeling & Evaluation === | === Modeling & Evaluation === | ||
# Select 3 model algorithms for training and comparison (I recommend RF, gradient boosting (XGB), & LOGR) | # Select at least 3 model algorithms for training and comparison (I recommend RF, gradient boosting (XGB), & LOGR) | ||
## Decision tree classifier (DTC) & support vector machine (SVM) algorithms are also | ## Decision tree classifier (DTC) & support vector machine (SVM) algorithms are also recommended. | ||
## Information about each model algorithm is thoroughly documented at https://scikit-learn.org/stable/api/sklearn.ensemble.html. | |||
# Record accuracy, AUROC, f1, precision, recall, and confusion matrices for each model. | # Record accuracy, AUROC, f1, precision, recall, and confusion matrices for each model. | ||
# Generate a SHAP summary plot to visualize feature importance | # Generate a SHAP summary plot to visualize feature importance | ||
| Line 43: | Line 44: | ||
## Save the fitted scaler, trained model, and feature names to pickle files using joblib, then load them into the new file. | ## Save the fitted scaler, trained model, and feature names to pickle files using joblib, then load them into the new file. | ||
# Check for missing data, extra columns, and reorder columns as needed to match the training dataset. | # Check for missing data, extra columns, and reorder columns as needed to match the training dataset. | ||
# Apply the saved scaler using .transform() | # Apply the saved scaler using .transform(). | ||
# Generate a class prediction and print the results clearly. | # Generate a class prediction and print the results clearly. | ||
## OPTIONAL: generate a probability score in addition to the class prediction | ## OPTIONAL: generate a probability score in addition to the class prediction. | ||
# Validate the script by testing several single-patient predictions. | # Validate the script by testing several single-patient predictions. | ||
Latest revision as of 16:40, 24 February 2026
Go Back to PredictMod Project.
Step-by-Step Tutorials
The following tutorials are recommended for those interested in creating and submitting models to the PredictMod platform.
Recommended Pipeline Steps
Please use the existing tutorials on the wiki and supplement your pipeline with the information below. Please indicate which of this content, if any, is more useful than what is online so that we can revise accordingly.
Environment Setup
- Install VS Code, Miniconda, and Python (version 3.11 is recommended).
- Miniconda is great for creating isolated environments. Learn more at https://www.anaconda.com/docs/getting-started/working-with-conda/conda-intro-tutorial.
- Make sure relevant packages are imported:
- scikit-learn, matplotlib, numpy, pandas, shap, xgboost, seaborn, and joblib packages.
Data Preparation
- Import the dataset. The ML-ready dataset should:
- Include all required variables for model training
- Include a clear R/NR column
- Run initial descriptive statistics
- Split the data into train/test
- Handle missing data (if needed)
- Simple imputation (median/mode) should be sufficient for first-time modelers, KNN is recommended but more complex.
- Conduct feature selection
- Select a method (stepwise selection using random forest (RF) or logistic regression (LOGR) is recommended)
- Select a threshold (a maximum threshold of 10 or 15, or a percentage threshold of 5% or 1% is recommended)
Modeling & Evaluation
- Select at least 3 model algorithms for training and comparison (I recommend RF, gradient boosting (XGB), & LOGR)
- Decision tree classifier (DTC) & support vector machine (SVM) algorithms are also recommended.
- Information about each model algorithm is thoroughly documented at https://scikit-learn.org/stable/api/sklearn.ensemble.html.
- Record accuracy, AUROC, f1, precision, recall, and confusion matrices for each model.
- Generate a SHAP summary plot to visualize feature importance
- OPTIONAL: conduct hyperparameter tuning using random search, grid search, cross validation, etc.
Single-Patient Predictions
- Once trained, create a separate .py script to run predictions using the model of choice.
- Save the fitted scaler, trained model, and feature names to pickle files using joblib, then load them into the new file.
- Check for missing data, extra columns, and reorder columns as needed to match the training dataset.
- Apply the saved scaler using .transform().
- Generate a class prediction and print the results clearly.
- OPTIONAL: generate a probability score in addition to the class prediction.
- Validate the script by testing several single-patient predictions.
Documentation
- Each step of your script should include a comment indicating the intended function.
- Each model must be accompanied by a BioCompute Object.