Augmenting real data with synthetic data: Difference between revisions

From HIVE Lab
Jump to navigation Jump to search
Created page with "In biomedical research, small sample sizes often pose challenges for developing robust machine learning models and evaluating computational scalability. To overcome this limitation, we have designed an algorithm that utilizes conditional Generative Adversarial Networks (cGANs) to generate synthetic data, effectively expanding available datasets. While synthetic data may not always improve model accuracy, it provides researchers with the ability to assess computational ef..."
 
mNo edit summary
 
Line 1: Line 1:
<small>Go Back to [[PredictMod|PredictMod Project]]. </small>
In biomedical research, small sample sizes often pose challenges for developing robust machine learning models and evaluating computational scalability. To overcome this limitation, we have designed an algorithm that utilizes conditional Generative Adversarial Networks (cGANs) to generate synthetic data, effectively expanding available datasets. While synthetic data may not always improve model accuracy, it provides researchers with the ability to assess computational efficiency such as training and inference times on larger datasets. Additionally, it enables the augmentation of underrepresented classes in binary classification tasks, helping to balance datasets and improve model evaluation as implemented in SMOTE. As real-world data becomes available, users can compare its quality against the generated synthetic data to evaluate its utility.  
In biomedical research, small sample sizes often pose challenges for developing robust machine learning models and evaluating computational scalability. To overcome this limitation, we have designed an algorithm that utilizes conditional Generative Adversarial Networks (cGANs) to generate synthetic data, effectively expanding available datasets. While synthetic data may not always improve model accuracy, it provides researchers with the ability to assess computational efficiency such as training and inference times on larger datasets. Additionally, it enables the augmentation of underrepresented classes in binary classification tasks, helping to balance datasets and improve model evaluation as implemented in SMOTE. As real-world data becomes available, users can compare its quality against the generated synthetic data to evaluate its utility.  



Latest revision as of 18:43, 28 August 2025

Go Back to PredictMod Project.

In biomedical research, small sample sizes often pose challenges for developing robust machine learning models and evaluating computational scalability. To overcome this limitation, we have designed an algorithm that utilizes conditional Generative Adversarial Networks (cGANs) to generate synthetic data, effectively expanding available datasets. While synthetic data may not always improve model accuracy, it provides researchers with the ability to assess computational efficiency such as training and inference times on larger datasets. Additionally, it enables the augmentation of underrepresented classes in binary classification tasks, helping to balance datasets and improve model evaluation as implemented in SMOTE. As real-world data becomes available, users can compare its quality against the generated synthetic data to evaluate its utility.

Various algorithms have been employed to generate synthetic biomedical data, addressing limitations in sample size for machine learning applications. For example, one study developed a method to generate synthetic populations from matched case-control data (n = 180 pairs) using multivariate kernel density estimations (KDEs) with constrained bandwidth matrices, effectively expanding datasets for more reliable modeling. More recently, generative AI techniques such as ESOM (Emergent Self-Organizing Maps) - based augmentation, which concentrates new points around existing data, enhancing their utility for downstream analysis. ESOM-based augmentation does appear to be a viable alternative to producing synthetic data, also capable of capturing global and local features, but can have a higher time complexity and computation costs compared to certain GAN architectures.

Our GAN-based method achieves more realistic data compared to other generative approaches by leveraging a feature learning approach, wherein the generator iteratively refines its output in response to feedback from the discriminator. The cGANs utilize patient data tags as conditional inputs such as responder/non-responder status during the synthesis of new data. The original GAN architecture does not use tags/conditions and reduces its ability to accurately represent original data. Other traditional generative models such as Variational Autoencoders (VAEs), they may be more useful in capturing outliers and noise, but can suffer from lower sample sharpness and not capture complex data. GANs implicitly learn complex data distributions through the competitive optimization between the generator and discriminator. This allows them to capture high-dimensional dependencies and generate more realistic outputs without requiring explicit density estimation from the data. However, GANs can suffer from mode collapse, where generators produce limited, highly similar outputs instead of capturing the full range of the target data distribution. Despite advancements in GAN-generated medical data, validation metrics such as Wasserstein metric or negative log-likelihood cannot guarantee the generated medical data will retain patient or cohort-wide characteristics essential for building predictive outcome models.

For a robust validation approach, five different metrics were used to compare the synthetic data to their original counterpart: Mean Difference, Standard Deviation Difference, Pairwise Euclidean Distance, Feature Correlation, and Mean Kolmogorov-Smirnov (KS) Test P-Value. These metrics were used for the following methods for comparison: SMOTE, GAN, and cGAN. The cGAN model better preserves statistical relationships while maintaining greater sample diversity than SMOTE without losing too much realism, than the GAN achieved.