BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

New Diabetes Study Shows How Big Data Might Drive Precision Medicine

Following
This article is more than 8 years old.

Disease descriptions reflect contemporary understanding.  “Dropsy,” for instance, was used for hundreds of years to characterize patients (including Beethoven) suffering from excess fluid accumulation; the word itself derives from the Greek hydor, meaning water.

As our understanding of human physiology deepened, “dropsy” was eventually recognized as symptom – now called edema – that can be associated with a range of different underlying conditions, including heart failure and kidney disease, more precise diagnoses that often lead to distinct treatment strategies.

(Disclosure/reminder: I am Chief Medical Officer of DNAnexus, a cloud data management company focused on genomics and other health-related information.)

Subdividing a complex conditions into subtypes based on underlying cause represents an important goal of medical research, and is a key ambition of precision medicine.  "The use of health data to better define and proscribe the boundaries of different diseases will only help understand them better and define more focused treatments for them," explains Russ Altman*, a physician-scientist and informaticist at Stanford.

We’ve already seen initial glimpses into what success might look like.  The improved characterization of certain types of cancers (such as some forms of leukemia) has led to targeted therapies (such as imatinib [gleevec]) that have significantly improved the lives of patients.  Similarly, understanding the distinct deficits associated with different cystic fibrosis mutations have enabled the development of targeted therapies such as ivacaftor (kalydeco), a medicine that has profoundly impacted the lives of a small subset of CF patients, including Bill Elder, Jr.

As exciting and important as these successes have been, progress in many other conditions has been maddeningly slow.  While the theory has been that more precise patient characterization -- aided by emerging technologies like next-generation sequencing and wearable sensors -- should lead to the identification of clinically important subgroups, and ultimately to treatments approaches that are more targeted and more effective, there are strikingly few examples of this translating into practice – a troubling gap I recently discussed here.

Type 2 diabetes – which afflicts an estimated 28 million Americans – has long seemed like the perfect example of a disease that should lend itself to informative subtyping.  As my former colleague Denny Ausiello put it (as captured by Vinod Khosla), “the term ‘diabetes’ as a disease will disappear in the next decade or two just as the term Dropsy has disappeared…. [T]here are a dozen very distinct diseases that all have a common symptom in ‘poor blood sugar control’ but in fact need very different treatment and management.”

Diabetes doctors “are aware that type 2 diabetes is a heterogenous condition,” adds Dr. Jose Florez, Chief of the Diabetes Unit at MGH (disclosure: I trained there), “and that you can reach the diagnosis of hyperglycemia via several different pathogenic mechanisms.”

In this context, the recent publication in Science Translational Medicine of an approach to divide type two diabetes, as a whole, into three subtypes is of particular interest.

The researchers – led by Joel Dudley at the Icahn School of Medicine at Mount Sinai – leveraged a biobank they had built, consisting of genetic and EMR data from over 10,000 consented patients.  They focused on the 2500 or so biobank patients who had type two diabetes, and asked a computer to group these patients on the basis of similar clinical characteristics, based on items found in the medical record.

What emerged were three distinct subtypes – one enriched for microvascular complications such as diabetic nephropathy and diabetic neuropathy, one enriched for cancer and certain cardiovascular diseases, and one enriched for neurological disease, allergies, HIV infection, as well as other cardiovascular diseases.   Each subtype was also associated with specific genetic variants – variants that were enriched in one subtype compared to the others – and in some cases, particular molecular pathways.

"In a funny sense, I’m surprised they only found a small number of T2DM clusters," commented Altman* of Stanford, adding,

"I can imagine that [type 2 diabetes] arose in multiple indigenous populations over time, and so I would not be surprised if it is a cluster of 30 or 50 or 100 phenotypically similar molecular 'disorders.' I use quotes because there were probably very adaptive reasons for the development of these molecular differences."

“The fact that the genetic enrichments seem to corroborate many of the clinical associations is what got me most excited about this paper,” senior author Joel Dudley told me.  (Disclosure: I’ve no personal or professional association with Dudley or his research group; our email exchange around this paper was our first contact.)  While the genetic information wasn’t used to generate the original three clusters, Dudley notes,

“Many of the genetic findings seem directly relatable to the clinical differences among the groups.  For example, the group with increased cardiovascular risk had unique genetic features enriching various cardiac and arterial pathways.”

Of course, the ability to construct a “plausible” story from genes that come out of a complex analysis (more on that in minute) doesn’t constitute validation of the approach, and much of the early history of gene expression arrays is littered with examples of promising genes, compelling narratives, and no replication.

Making interpretation more difficult, some experts found the methods inscrutable. A leading academic researcher in this area told me, “to be honest, I’m not sure what to make of this paper,” in part because the analysis used “commercial black-box software that is very poorly described in the paper,” and in part because the genetic analysis was done in a fashion that led this scientist to say, “I honestly can’t figure out how to interpret the P-values coming from this, and again the methods are poorly described.”

I asked Dudley about one aspect of this – the challenge of multiple comparisons (a topic I’ve written about here). Dudley replied that the paper was evaluated by “informatics and statistics peer reviewers, and believe me they put us through the ringer to deal with this issue and thankfully we came up with a solution” that satisfied them.

As Florez of MGH nicely summarized for me, there are three main challenges with this type of study:

  • Reproducibility: can these findings be repeated with other data sets?
  • Interpretability: what physiological insight is afforded by these findings?
  • Clinical utility: what does this mean to the diabetologist treating the patient in front of her?

Dudley, it turns out, agrees completely.  He’s looking for a replication data set (the challenge, I imagine, is that there are relatively few available); he anticipates the need to “dive into the biology of the genetic factors revealed by this study”; and he’s keen to “perform a prospective evaluation” of the markers to see if they might have prognostic clinical value for doctors and patients.

Implications

The appeal of the approach used by the authors is the extent to which it seems so scalable, and so improvable by the inevitable addition of more, richer, and cleaner data.  As Dudley told me, “The EHR data is quite sparse and noisy, so I am honestly shocked we were able to get anything meaningful out of it.”

Imagine datasets that included richer genomic data (such as whole exome or whole genome sequencing), more complete EHR data, and more data from wearables and other devices that capture what I’ve called “dynamic phenotype.”

To this point, it’s largely been an article of faith that such rich integrated datasets would be useful and clinically important.  The hope this paper represents is that perhaps we’re finally closing in on a tipping point, or a coalescence point, when we’ll at last see a compelling return on all the upfront effort involved in creating such a rich datasets, and also start to see which sorts of data collisions generate the most impactful results.

I’d like to believe that once the value of rich integrated datasets is clearly demonstrated, institutions will respond by recognizing the importance to science and to patients of sharing and combining data.  More likely, organizations will instead urgently accelerate their efforts to build – and monetize -- their own integrated data silos.

Addendum (*): Altman quotes added on November 1, 2015.