Talk about healthcare “big data” seems to be everywhere. It is a discussion generously leavened by the promise of the future and I am hopeful—even if only a quarter of it comes true—that it will herald a revolution in the delivery of effective and efficient clinical care. In the meantime, here is a hard-won personal primer for those of us who still have to work on planet Earth.
Lesson 1: Healthcare data are getting cheaper to acquire but more expensive to use.
With the digitization of so much of the healthcare economy in the last 10 years, data are now plentiful and relatively cheap to acquire. However, they grow very fast and require continuously more server and computational capacity to store and manage. Cloud-based hosting helps, but dialing up servers to run reports can get expensive quickly, especially if big data interrogation isn’t your core business.
Lesson 2: Big data aren’t always better.
A fair amount of the clinical findings from healthcare big data confirm what earlier, smaller studies and trials have already shown. That’s why researchers do power calculations. They know that once you get to the right sample size, 15 million records do not necessarily provide more insight than 1,500. If you make the mistake of asking the same question over and over again to ever larger data sets, in all likelihood a lot of time and money will be spent getting the same answer. The real trick to exploiting big data successfully is assessing whether size contributes to the underlying heterogeneity and potential explanatory power of the data set.
Lesson 3: Source data quality can have big analytic consequences.
It’s worth remembering that almost by definition big data are a byproduct of some other transaction and will be used for a purpose for which they were not designed. At best, big data doesn’t represent the truth, but one internally consistent version of the truth. The problem is that end users are often oblivious to how the source data was captured, cleaned, structured and normalized. Mistakes, errors, or even judgment calls of database administrators and developers anywhere along that chain can have a big impact on analytic outputs.
Even more concerning in healthcare big data analytics is the underlying volatility of the source systems themselves. Since data requirements are almost always driven by their primary application—for example documentation and billing in EHRs—few system operators give any thought to how an “upgrade” may affect secondary applications. Given that the Meaningful Use era in healthcare informatics is essentially defined by underlying source system change, big data users need to be mindful that even the internal consistency of data sets may be in routine jeopardy.
Lesson 4: Most big data analyses are surprisingly simplistic.
Many vendors will profess their big data acumen by referencing a long string of impressive-sounding statistical models. They’ll probably also speak about “world class” proprietary systems for managing and rendering data.
Don’t be fooled. Most big data analyses used in the real world are no more complex than what can be done in Excel pivot tables. The issue is that X population may be 200 million patients while parameters Y and Z are stored in a database whose schema looks nothing less complex than the human genome itself. It takes a fair amount of skill to get those answers quickly and consistently. Even better vendors will help define an analysis plan to answer questions that don’t seem possible from the data set at first glance.
Lesson 5: The delta between information, insight and intervention remains very large.
While there are some notable exceptions, healthcare big data analytics now allow us to do with lots of data what we used to do with a little data. The size of a data set usually doesn’t make it any easier to determine causality or to figure out what’s going wrong and how to solve it. I was once told that big data just allows good managers to do what they would have done anyway, but with more confidence.
The hard truth is that the size of the data set doesn’t seem to make answers any easier to come by. Big data are a tool; it’s the human interpreter who still has to do the really hard work.
Lesson 6: The greater the scope of your data, the more discipline needed to make sense of it.
One of the great temptations of healthcare big data is that they seem to present almost limitless possibilities for study and analysis. And in many ways, they do. What often happens,