It’s hard to understate the impact machine learning will have on biomedicine. The ability to train computers to spot patterns by analyzing large, complex datasets is driving discoveries in heart disease, cancer, neurodegenerative diseases and more. For instance, the U.S. Department of Energy’s (DOE) Argonne National Laboratory has used machine learning to aid cancer research and accelerate COVID-19 antiviral discovery.
One of the main challenges is finding ways to share patient information between research organizations without violating privacy regulations. The Health Insurance Portability and Accountability Act (HIPAA) has strict rules on how organizations can share patient data. The PALISADE-X project is working to solve this issue in order to improve research outcomes.
Ravi Madduri is a computer scientist at Argonne and is the principal investigator of the PALISADE-X project. Along with colleagues from The University of Chicago, DOE’s Lawrence Livermore National Laboratory, Massachusetts General Hospital, and The Broad Institute, the team was able to create an effective and efficient means of enabling machine learning on sensitive patient data.
Security and privacy in biomedicine
Like many other areas of science, machine learning (ML) and artificial intelligence (AI) have taken the biomedical field by storm. Machine learning has been used mostly in terms of predicting different individuals’ risk for diseases and predicting how drugs or different chemical compounds work for different people.
While AI is mostly an umbrella term covering the entire pursuit of making machines mimic human intelligence, ML is about the specific process of teaching a machine to do something by training it to identify patterns.
“It takes a lot of training for an ML model to find patterns, and they need a lot of data to train on,” says Madduri.
The data used in biomedicine research often is personally identifiable information (PII). This might include a patient’s gender, birth date, race and ethnicity, where they live, and specific details about their health. It’s sensitive data, and there are federal regulations in terms of how PII is used, how it is stored, how it is analyzed, and how it is combined with other data sets.
“Patient data is held by hospital systems, public health agencies, government programs like the Veterans Health Administration and others,” says Madduri. “It’s private data, and health organizations usually don’t share it. We wanted to find a way to encourage sharing and training while protecting PII.”
PALISADE-X stands for Privacy-preserving Analysis and Learning in Secure and Distributed Enclaves and Exascale Systems. The project is funded by the DOE Office of Science’s Office of Advanced Scientific Computing Research with a goal of creating and experimenting with secure computing technologies and privacy enclaves – a way to isolate code and data from an operating system – to store and analyze PII data.
“We developed a framework that allows AI/ML models to be trained on biomedical datasets from multiple health organizations while preserving the privacy of PII data,” says Madduri. “It’s called the Argonne Privacy-Preserving Federated Learning framework. Right now, every time you want to create an AI model, you centrally collect all the data in the world, you label it, and you train the model. As data volumes become larger, this will no longer be practical.”
With federated learning, instead of sending data to models you send AI models to where the data is stored. That way, health organizations can keep their own data private by not having to send it anywhere, but the model still improves because it is trained on more datasets.
“It’s impossible to win the battle to make all the data available all the time, especially when the data is considered private and is identifiable,” says Madduri. “So privacy-preserving federated learning is really the silver bullet we need to make AI work well while keeping data private.”
Another issue is that while AI models in biomedicine are typically static and unchanging, even when the distribution of data can change. Madduri says the COVID-19 pandemic is a good example. The different strains of a virus that are common in the early stages of a pandemic are different from how the disease manifests in the later stages. If you are building an AI model to predict the risk posed to people in the early stages of a pandemic, that model won’t work well to predict the risk in later stages.
In computer science terms, what that means is that the underlying data distribution and the underlying risk profile are changing, but the AI models are not getting updated data to catch the latest trends in how the disease is evolving. This is sometimes called data drift. With PALISADE-X, one goal was to develop a framework that allows for quantifying the data drift in biomedicine using privacy-preserving federated learning.
Bringing it to the real world
Bringing these solutions to real-world applications was aided by the large amount of data generated by the COVID pandemic.
“One of the initial projects that we started as an application for PALISADE-X was to predict COVID by training a model on chest X-rays,” says Madduri. “You can build an AI model that can predict the onset of COVID-19 when you train it on people that have COVID and on people who don’t have COVID, based on their X-ray data.”
First, the researchers created a baseline model that could predict COVID from chest X-rays using publicly available data. Then, they used the baseline model and conducted privacy-preserving federated training across patient X-ray data located at The University of Chicago and Mass General Biobank. Privacy-preserving federated learning allows multiple institutions to individually work on training a shared ML model while keeping their training data locally secured. Here, this allowed the organizations to create a COVID prediction model without having to share private patient data.
Another use case with Mass General Biobank trained an AI model to predict biological age using electrocardiogram (ECG) data. An ECG can predict the age of a heart, which could be different from the person’s biological age. When somebody smokes, their heart and lungs age more rapidly.
In a traditional risk calculator, age plays a big role. The older a person’s age, the higher the risk for heart disease. However, these models don’t catch younger people whose hearts have aged due to lifestyle factors.
The PALISADE-X model uses publicly available data from the PhysioNet project to create a baseline model that was further trained using a federated learning approach with data from UK Biobank and the Mass General Biobank. Researchers found that when we used the age predicted by the model instead of the biological age, it performed much better at predicting the risk of a heart attack that young people face.
The future of privacy and usability
Madduri is excited at what the future holds for this technology.
“I’ve been working in developing and applying computing solutions to biomedical problems for the past 16 years,” says Madduri. “And most of the time, progress is hampered because it takes a long time to get the data. The data compliance issues prevent progress. What motivates me is the opportunity to work with this system and create robust AI models that can save people’s lives.”
Madduri says one of the strengths of the PALISADE-X project is the team of scientists across multiple organizations with different capabilities. That includes Argonne computational mathematicians Kibaek Kim and Minseok Ryu, Kyle Halliday from Lawrence Livermore National Laboratory, Maryellen Giger of The University of Chicago, and Pradeep Natarajan of the Broad Institute.
Additionally, the project has several ongoing international collaborations with researchers from Norway, Japan on leveraging tools developed in the PALISADE-X project to build trustworthy and robust AI models.
Madduri and the team are working on applying technologies developed in the PALISADE-X project, notably the Argonne Privacy Preserving Federated Learning toolkit, to help achieve the objectives of the National Institutes of Health Bridge2AI program. The overall goal of the Bridge2AI program is to generate new “flagship” data sets and best practices for machine learning analysis.