PredictMod Automated Pipeline

From HIVE Lab
Jump to navigation Jump to search

Go Back to the PredictMod Home Page.

Overview

We have created an automated model training pipeline for researchers who have a significant collection of data without the programming expertise required to create and train such models. Such researchers may use this automated pipeline to directly upload data.

Upon upload of training data, the PredictMod platform performs several consecutive steps. Data are first inspected for general suitability for modelling, with typical errors such as missing values flagged for the user to edit as appropriate.

Next, given an appropriately-formatted data set that has all such errors removed, the platform provides the user results from several ML and statistical algorithms to provide both additional insight into the data and intervention outcome prediction models for use with future data points. Specifically, the user is given Principal Component Analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) clustering outputs as a means for the user to explore for clear patterns in the data or other such low-hanging fruit. The user is also provided trained models, and corresponding confusion matrices and Receiver Operator Characteristic (ROC) curve analysis for a diverse family of ML models, including Random Forest, Decision Tree Classifiers, Support Vector Machines, Logistic Regression, and Boosting algorithms. The trained models are retained within the PredictMod platform for the user to further introspect, use with additional data points, download, or eventually publish to the broader PredictMod ecosystem.

Additional resources for each algorithm are provided to the user for each resulting model in the form of links to both algorithm descriptions and original source references that describe the underlying algorithms.

Usage Guide

  1. Login to the PredictMod Platform
  2. Navigate to "More" -> "Automated Pipeline"
  3. Train a model
    1. The model needs a name to group resulting models under.
    2. There needs to be a response column - this is the column in training data containing ground-truth labels (often "R", "NR"; the column is often named "Status" or "Response")
    3. There can be/should be columns to drop, such as sample names, etc.
  4. Once trained, click on "New sample with training model"
    1. Select a family of models as needed
    2. Choose models to sample
    3. Upload the new data point