Sunday, October 9, 2016
Big Data - What Is It Good For?
Big Data and Data Science have been buzzwords in science and industry for over a decade. A Medline search shows over a thousand current references to Big Data in healthcare. A good starting point is consider what is meant by Big Data and then discuss the implications. A quick scan of the references shows that they vary greatly in technical complexity. A standard definition from Google is: "extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions." These techniques developed because of the widespread availability of digitized data and the ease with which sets of behavioral choices (in the form of mouse clicks) can be collected on web sites. In many cases specific data collection paradigms can be used to elicit the information, but there is also a wealth of static data out there as well. In health care, any electronic health record is a massive source of static data. Financial, real estate, educational records and records of all of these transactions are also a significant source.
Most Americans logging in to set up a Social Security account online (ssa.gov) in the past couple of years would be surprised at what it takes to complete the job. After the preliminary information there is a set of 5 security questions. Four of those questions are about your detailed personal credit history - home mortgage information and credit card history. When Social Security was initially set up in 1936 there was widespread concern that Social Security Numbers would become national identifiers. At one point Congress had to assure the electorate that the number would not be used for that purpose. Since then the SSN has been used for multiple identification purposes including credit reporting. At this point it seems that we have come full circle. Congress invented the SSN and told people it would not be an identifier. They mandated its use as an identifier. Congress authorized and basically invented the credit reporting system in the United States. The federal government currently uses the credit reporting system to quiz taxpayers wanting to set up a Social Security account online. In the meantime, large amounts of financial, legal, and health care data are being collected about you under your SSN in data systems everywhere. At this point the full amount of that data and the reasons why it is being collected for any person in the US is unknown because it is all collected without your knowledge or your consent. It is impossible to "opt out" from this data collection. The federal government does have an initiative to remove SSNs from health records, but there are so many other identifiers out there right now, this effort is too little and too late.
Additional sources of data include your online foot print including sites that you may have visited and what you seem to be interested in. A visit to Amazon for example and a quick look at an expensive digital camera may result in that same camera with a link to Amazon in the margin of very other web page you see for the next two weeks. Expensive digital cameras of a different brand than the one you originally looked at may start showing up. You may notice product ads showing up in your Facebook feed that you mentioned casually to your friends during a conversation there. The conversation could be as generic as bicycle seats and suddenly you are seeing a flurry of ads for bicycle seats. Any number of web sites encourage to sign in with other accounts and then share your account information with them. All of this data provides companies with what they need to fuel their predictive algorithms to sell you a product. It provides the major advertisers in this space like Google and Facebook with a huge revenue source because based on the scale and personalization of these ads - they are effective. Big Data seems to be very good for business. But is there a downside?
That brings me to a current resource on the nefarious uses of Big Data written by an expert in the field. Cathy O'Neil is a PhD in mathematics. Her PhD work was in algebraic number theory. She started work as an academician but subsequently worked for a hedge fund, work as a data scientist for several firms and currently heads the Lede Program in Data Journalism at Columbia University. I am familiar with her work through her blog MathBabe. Her newly released book Weapons of Math Destruction takes a look at the dark side of Big Data specifically how data collection and biased algorithms can be good for administrators, politicians and business but bad for anyone who falls under the influence of those agencies and their work. In the introduction she leads of with the example of teacher assessments. I was familiar with a scattergram that she had posted on her web site showing that year to year teacher assessment scores were essentially uncorrelated or random. In the book she describes the human toll in this case a teacher fired because of this defective algorithm. In another example later in the book, an experienced teacher scored a 6 out of 100 on a "value-added" teacher evaluation. Only tenure kept him from getting fired. The scoring algorithm was opaque and nobody could tell him what had happened. The next year he scored the 96 out of 100. But the algorithm was so flawed he knew that score was no more legitimate than the last one. With the politicized environment surrounding teaching the proponents of teacher "accountability" like this variation since it fits their ideas about the system retaining incompetent teachers that need to be weeded out. In fact, the algorithm is defective and like many is based on erroneous assumptions.
I personally know that physicians are subjected to the same processes as teachers, but so far it is less technologically advanced. O'Neill points out that there is nothing magical about algorithms. That they frequently incorporate the biases of the people who design and contract for them. Opacity and a lack of correction by feedback is another feature. I worked for the same employer for a number of years when physician "accountability" measures were put in place. The "algorithm" for salary went something like this RVU Productivity + Outside Billing + Citizenship = Pay. RVUs were the total number of patients seen according the the biased government and managed care billing schemes. Outside billing was any consulting work done outside of the clinical work that was billed through the department. Citizenship included teaching and administrative duties as well as any Grand Rounds or CME lectures that were done. In other words apart from the subjectively based billing scheme all of the inputs are almost totally subjective and influenced by all kinds of pseudoaccountability measures along the way. For example, in parallel with the teacher ranked on the algorithm, I was told one year that I had achieved the top rank in terms of documentation in a group of about 25 physicians. The next year - making no changes at all in terms of that documentation - I was dead last. My conclusion, like the teacher in the example was that the rating scheme was completely bogus and with that kind of a scheme who cares about the results?
The number of based algorithms applied to physicians has eerie parallels to those mentioned in WMD. Here are a few that I picked out on the first read:
1. The algorithm is based on faulty data - the teacher evaluation algorithms were based on a faulty interpretation of data in the Nation at Risk report. The report concluded that teachers were responsible for declining SAT scores between 1963 and 1980. When Sandia Labs reanalyzed the data 7 years later they found that an great expansion in the number of people taking the test was responsible for decreased average score but subgroup analysis by income group showed improved scores for each group (p. 136). The only reason that teachers are still being blamed is political convention. I posted here several years ago that the top ranked students in the world in Finland are taught by teachers who are assumed to be professionals and who are not critiqued on test results.
The parallel in medicine was the entire reason that medicine is currently managed by the government and the healthcare industry. It was based on criticism in the 1980s that doctors were lining their pockets by performing unnecessary procedures and that work quality was poor. That should sound familiar because that criticism has been carried forward despite a major study that showed it was completely wrong. The massive Peer Review Standards Organizations (PRSO) in each state in the 1990s conducted rigorous reviews of all Medicare hospitalizations and concluded that there was so little overutilization and so few quality problems that it would not pay to continue the program. The only reason that managed care companies exist today is by political convention.
2. An effective teacher like an effective doctor is too complex to model - When that happens only indirect measures or "crude proxies" (p. 208) can be used to estimate effectiveness. In medicine like teaching - the proxy measures are incredibly crude. They generally depend on diagnosis, poorly account for comorbid illness, and the outcome measures are heavily influenced by business rather than medical decision making. The best examples are length of stay parameter and readmission parameters. Every physician knows that there are set payment schedules based on the supposed ideal length of stay for a particular illness. The business influence in the discharge decision is so malignant these days that non-physician case managers are present to pressure physicians into discharging patients. If the discharge beats the length of stay parameter - the hospital makes money. I sat in a meeting at one point and asked the obvious question: "OK - we have completed the discharge checklist - do we know the outcomes? How do the patients do when they are discharged by this process? How many of them die?" Dead silence followed. Most people would be shocked to hear that what passes for evidence based medicine is often a checklist that has no meaning in the real world. Making the points on the checklist is good for advertising though.
3. There is a lack of transparency in the overall process - The teachers in WMD who were blindsided by the algorithm were never told how that conclusion was reached. I encountered the same problem in a managed care organization when it was clear to me that administrators with no knowledge of psychiatry were telling us what to do. In some cases, "consultants" were brought in to write reports to confirm the most recent administrative edicts. When I asked my boss if I could talk with the people sending out the edicts I was informed that there was a "firewall" between clinicians and upper management. This lack of feedback is another critical dimension of algorithms gone astray. If you are writing an algorithm biased toward a business goal - why would you want feedback from clinicians? Why would you want any humanity or clinical judgment added especially in the case of psychiatric care? Let's just have a dangerousness algorithm and leave it at that. Those are the only people who get acute treatment, even though it is patently unfair relative to how the rest of medicine works.
Big Data is good for science. We can't do elementary particle physics or genomic analysis very well without it. Big Data is also good for business is much different ways. There are clearly people out there who cannot resist buying items online if the Amazon algorithm shows it to them enough times across a number of web pages. Big data in business can also come up with billing algorithms that have less to do with reality than making a profit. Similar programs can be found for employee scheduling, performance analysis, and downsizing. The problems happens when the business biases of Big Data are introduced to science and medicine. Those techniques are responsible for an array of pseudoquality and pseudoaccountability measures for physicians, hospitals, and clinics.
Unfortunately physicians seem to have given up to the political conventions that have been put upon us. Some administrator somewhere suggests that quality care now depends on a patient portal into an electronic health record and a certain number of emails sent by patient to their physician every month. Across the country that will result in hundreds of millions of emails to physicians who are already burned out creating highly stylized documentation that is used only for billing purposes. Terabytes of useless information that nobody will ever read again - the product of a totally subjective billing and coding process that started over two decades ago. Is there any data that email communication is tied to the effectiveness or technical expertise of the physician? I doubt it. I worked with great physicians long before email existed.
It is about time that somebody pointed out these manipulations provide plenty of leverage for the management class in this country at the expense of everyone else. It is well past the time that doctors should be confronting this charade.
George Dawson, MD, DFAPA
1. Cathy O'Neill. Weapons of Math Destruction - How Big Data Increases Inequality And Threatens Democracy. Crown Publishing Group. New York, NY, 2016. I highly recommend this book for a look at the other side of Big Data. It is written in non-technical language and is very readable.
The photo at the top is a Server Room in CERN By Florian Hirzinger - www.fh-ap.com (Own work (Florian Hirzinger)) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0) or GFDL (http://www.gnu.org/copyleft/fdl.html)], via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File%3ACERN_Server_03.jpg"><img width="512" alt="CERN Server 03" src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d7/CERN_Server_03.jpg/512px-CERN_Server_03.jpg"/></a>