HIVE De-identification Tool: Difference between revisions
Lorikrammer (talk | contribs) mNo edit summary |
Lorikrammer (talk | contribs) mNo edit summary |
||
Line 1: | Line 1: | ||
The High-performance Integrated Virtual Environment (HIVE) de-identification tool is used to de-identify all GW-FEAST data sources prior to harmonization. A process overview is detailed below (authored by Dr. Robel Kahsay, Mazumder Lab, Department of Biochemistry and Molecular Medicine, School of Medicine and Health Sciences, The George Washington University). | The High-performance Integrated Virtual Environment (HIVE) de-identification tool is used to de-identify all GW-FEAST data sources prior to harmonization. A process overview is detailed below (authored by Dr. Robel Kahsay, Mazumder Lab, Department of Biochemistry and Molecular Medicine, School of Medicine and Health Sciences, The George Washington University). | ||
== HIVE De-identification Tool Diagram == | |||
[[File:Nbcc deidn tool v1.0.png|frameless|743x743px]] | |||
== De-identification of Tabular Data: A Process Overview == | == De-identification of Tabular Data: A Process Overview == | ||
Line 40: | Line 42: | ||
=== Conclusion === | === Conclusion === | ||
The de-identification tool offers a secure and customizable way to anonymize sensitive fields in tabular data while preserving the integrity of non-sensitive information. By focusing on configurable rules and secure handling of data, this tool ensures that the balance between data privacy and research utility is maintained. The process not only anonymizes data but also provides secure mapping for internal reference in controlled settings, adhering to privacy regulations such as HIPAA and GDPR. This makes the tool highly applicable for sensitive data management in medical research and other fields where privacy is paramount. | The de-identification tool offers a secure and customizable way to anonymize sensitive fields in tabular data while preserving the integrity of non-sensitive information. By focusing on configurable rules and secure handling of data, this tool ensures that the balance between data privacy and research utility is maintained. The process not only anonymizes data but also provides secure mapping for internal reference in controlled settings, adhering to privacy regulations such as HIPAA and GDPR. This makes the tool highly applicable for sensitive data management in medical research and other fields where privacy is paramount. | ||
Revision as of 13:47, 20 March 2025
The High-performance Integrated Virtual Environment (HIVE) de-identification tool is used to de-identify all GW-FEAST data sources prior to harmonization. A process overview is detailed below (authored by Dr. Robel Kahsay, Mazumder Lab, Department of Biochemistry and Molecular Medicine, School of Medicine and Health Sciences, The George Washington University).
HIVE De-identification Tool Diagram
De-identification of Tabular Data: A Process Overview
Introduction
The need for de-identifying sensitive data, particularly in the biomedical field, has become critical as data privacy regulations such as HIPAA and GDPR have placed stringent requirements on how personally identifiable information (PII) is handled. In many cases, research relies on the use of patient-level data, but the direct use of such data can compromise privacy if not properly de-identified. To address this, we developed a comprehensive de-identification tool designed specifically for tabular data, such as VCF, TSV, and TXT files, allowing researchers to securely manage data while maintaining compliance with privacy regulations. This document provides an overview of the de-identification process facilitated by the tool, describing the key steps and mechanisms that ensure data privacy is maintained while enabling research.
The de-identification Process
The de-identification of data involves transforming sensitive or identifiable information so that individuals cannot be re-identified without proper credentials. This process must strike a balance between removing identifiable information and preserving the utility of the data for research purposes. The de-identification tool is designed to be flexible, allowing customization based on the data type and specific fields in the dataset.
Key Steps in the de-identification Process
Configuration of de-identification Rules
The first step involves defining which fields in the dataset should be de-identified. This is achieved through a configuration file (anon.json), where each field in the data is classified as either 'anonymize' or 'keep.' Fields that contain sensitive information, such as names, email addresses, or other PII, are marked for de-identification. Non-sensitive fields, such as metadata or research-specific identifiers, can be kept as they are if they do not risk re-identification.
This anon.json file is reviewed, approved and signed by the individual/s who has authority to do so.
Token-Based Encryption
During the de-identification process, certain fields may be pseudonymized or masked using token-based encryption. A secure token is generated and stored privately, ensuring that the original values cannot be recovered without access to the encryption key. This token serves as a mapping between the original data and the de-identified data, but it is stored separately from the de-identified files to prevent unauthorized re-identification.
De-identification and Data Processing
Once the configuration is set, the tool processes each file in the input directory, replacing the values in fields marked for de-identification with de-identified or pseudonymized equivalents. The processed data is saved in a separate output directory, ensuring the original and de-identified files are kept distinct.
Secure Storage and Management of de-identified Data
A key part of the de-identification process is the secure handling of the dictionary mapping used during de-identification. This mapping, which tracks the relationship between original and de-identified values, is stored in a private JSON file that must remain inaccessible to unauthorized users. This file serves as a reference for internal use if re-identification is ever required (in controlled environments), but it is never exposed outside the secured environment.
Process Flow Summary
At a high level, the process of anonymizing data using this tool can be summarized as follows:
1. Data Ingestion: Tabular data is imported from various formats (e.g., VCF, TSV, TXT).
2. Configuration: An de-identification configuration file specifies which fields in the data will be de-identified.
3. Token Generation: An encryption token is generated and used to secure the de-identification process.
4. de-identification Execution: The tool replaces sensitive fields with de-identified values and stores the result in a separate output directory.
5. Secure Output: de-identified files are securely saved, while mappings and tokens are stored in protected locations to ensure privacy.
Conclusion
The de-identification tool offers a secure and customizable way to anonymize sensitive fields in tabular data while preserving the integrity of non-sensitive information. By focusing on configurable rules and secure handling of data, this tool ensures that the balance between data privacy and research utility is maintained. The process not only anonymizes data but also provides secure mapping for internal reference in controlled settings, adhering to privacy regulations such as HIPAA and GDPR. This makes the tool highly applicable for sensitive data management in medical research and other fields where privacy is paramount.