D

Data Wrangling

How can we best use the endless stream of data to drive better health outcomes?

When the Seven Countries Study began its field work in 1957, each bit of data was gathered by hand, says School of Public Health (SPH) Professor Emeritus Henry Blackburn, who was director of the famous study into the impact of lifestyle and eating patterns on cardiovascular disease.

Today, in addition to traditional research methods, data are continuously generated with no end point in sight from such sources as cell phones, GPS tracking devices, social media, wearable monitors, electronic medical records, and smartphone purchasing apps, to name just a very few.

“Data is everywhere and it’s getting more and more complex.”

 J. Sunil Rao

“In public health, we’re still gathering data in some cases the way they did in the Seven Countries Study, but we’re merging it with data that’s just ‘lying around’ from electronic medical records, social media, activity trackers, etc. waiting to be recognized as relevant,” says Professor Joe Koopmeiners, head of the Division of Biostatistics. “With data science, we’re expanding the scope of data gathering and analysis to answer the questions we’re asking today.” 

Wheat from the chaff

interesting lines and squares

As you’re reading this article, trillions of data points are being generated and stored, but as J. Sunil Rao says, “More data doesn’t necessarily mean more good data.” One of the major challenges of researchers and data scientists is separating the wheat from the chaff, letting the junk data blow away while saving the kernels that will move a study forward or lead to a vital discovery.

“Data is everywhere and it’s getting more and more complex,” says Rao, who will join the School of Public Health as a professor in biostatistics with a co-appointment as the Director of Biostatistics at the Masonic Cancer Center. “So, you have to develop new methods, you have to start thinking about how all this data is layered and about how to link data sources together to get a more complete picture of things.”

Rao was on the original research team that conducted the work for Cologuard, the in-home colon cancer screening test based on looking for particular DNA methylation patterns in stool samples. “You can use large reams of data to identify that specific little snippet of data that’s useful for early detection of things like colon cancer, “ says Rao. “But it requires you to find a needle in a haystack.”

Adding to these challenges are biases that creep into data collection and algorithmic development, and the growing need to develop methods that can extract reasonable inferences from privacy-protected data.

Biased numbers 

It’s easy to think that data is data, straight forward and objective, but that’s not always the case. Data can be full of problems from how it was collected and sorted. When that data is used to build algorithms, it can create racial, gender, economic, and other biased results. For example, if a study to predict heart disease risk only enrolls people self-identified as white, its findings are biased and can keep other populations from getting appropriate treatments or interventions.

Solvejg Wastvedt, a biostatistics PhD student, is working in the realm of “algorithmic fairness,” described as at the “intersection of machine learning and ethics.” Some of the work in this area involves cleaning the input data itself — stripping away extraneous information — training the algorithmic model by putting guardrails in place that prevent it from producing biased results, and going in after a model has been created to debias it and correct predictions.

“My work is about measuring unfairness in clinical risk predictions,” says Wastvedt. “I’m working to make more nuanced measures of a model’s fairness, such as by accounting for multiple, intersecting forms of discrimination.”

“There are a lot of steps involved in preprocessing data before you can use it. With the commons, we can standardize and harmonize data for the whole University and that will result in reproducible research.”

Saonli Basu

Getting the most out of EHR

Assistant Professor Jue Hou is tackling a different problem when it comes to data: how to extract information from electronic health records (EHR) while protecting privacy. The data in EHR is extremely valuable in discerning, for example, which COVID treatments are most effective or, for personalized medicine, an individual’s risk for certain diseases. 

“The ultimate goal is to combine information from various EHR without sharing on the level of individual data,” says Hou. “We do this under the umbrella of ‘federated learning,’ an emerging trend on how to manage this big data from healthcare systems.”

Jue Hou
Biostatistics Assistant Professor Jue Hou is using the emerging trend called “federated learning” to extract information from electronic health records while protecting privacy.

The way federated learning works in this context, explains Hou, is that healthcare systems form a consortium and a researcher sends an algorithm to each member of that consortium asking them to run that algorithm at the same time. That algorithm will produce a summary of each system’s data without any individual information included. The summaries of many systems can then be aggregated. Hou says that the federated learning model can cover many health systems in the U.S, and even globally. 

“In the U.S., we have a very segmented healthcare system,” says Hou. “If we use information from just one system, it wouldn’t reflect the full picture and we might not have the number of patients we need to answer statistical questions. With federated learning, we also have better generability.”

Big plans for big data

Data is increasingly complex and it is expanding in how it crisscrosses fields and disciplines. To wring the absolute most out of it, schools, colleges, divisions, and departments at the University are forming new alliances.

Solvejg Wastvedt
Biostatistics PhD student Solvejg Wastvedt works in the realm of “algorithmic fairness” that guards against biased data being used to build algorithms.

Biostatistics Professor Saonli Basu, who develops methods for large-scale genomic data analysis, is leading a partnership with the Masonic Institute for the Developing Brain, the Minnesota SuperComputing Institute, and the University of Minnesota Informatics Institute to create the Genomic Data Commons. Across the University, researchers will contribute their multi-omics data to a single place where it will be cleaned, stored, and shared. It will make genomic research more effective and encourage collaborations and joint problem-solving. This initiative will also develop analytic pipelines to assist researchers with their genomic data analysis.  

“There are a lot of steps involved in preprocessing data before you can use it,” says Basu. “If I’m analyzing the data, I’ll go through those steps using my own criteria, as will every other researcher who wants to use the data. Doing things that way introduces a lot of variability. With the commons, we can standardize and harmonize data for the whole University and that will result in reproducible research.”

Data and innovative ways to gather, use, and share it will play a central role in solving the public health challenges of the 21st century and the School of Public Health is positioned to provide data science leadership for major collaborative initiatives locally, nationally, and globally.


You can’t change what you can’t measure

The challenge of researching the impact of structural racism on health is not a case of too much data to sort through; it’s a problem of not knowing how to use the data to measure structural racism accurately. Structural racism is, as one researcher described it, “the water, not the shark.” The harm it does can be deadly, but it is all around you. 

To tackle this problem, researchers at the SPH Center for Antiracism Research for Health Equity developed the Multidimensional Measure of Structural Racism (MMSR) — multidimensional because it captures how various forms of structural racism reinforce one another to cause poor health among communities of color. The MMSR investigates structural racism through six quantifiable measures of inequity: residential, education, employment, income, wealth, and incarceration.

“Progress toward racial and health equities cannot be achieved if scholars and policymakers cannot measure the causes of the inequities, understand their deleterious effects, and track changes over time,” says research scientist Tongtan “Bert” Chantarat, who is leading the development of this measurement tool. “The MMSR helps researchers to do that.” Researchers plan to release The MMSR for public use in the near future.


Data science

In 2012, Harvard Business Review wrote that “there are no university programs offering degrees in data science.” That same year, it also called being a data scientist the “sexiest job of the 21st century.” A decade later, data science is still new to academia and, sexy or not, being a data scientist is certainly one of the most essential jobs today.

Data scientists clean, sort, design, understand, implement, and analyze the data essential for research, such as discovering what diseases lurk in your genes or, in a non health context, who’s sharing their Netflix password. As a discipline, data science is critical for any organization or field that demands usable data. For public health and medicine, data science is essential to allow the leaps and bounds people hope for to make health a human right. 
The School of Public Health began its Public Health Data Science MPH program — one of only a handful such programs in the U.S. — in fall 2021 and it is already surpassing its enrollment goals.

Begin typing your search above and press return to search. Press Esc to cancel.

Menu