Showing posts with label big data. Show all posts
Showing posts with label big data. Show all posts
Sunday, October 9, 2016
Big Data - What Is It Good For?
Big Data and Data Science have been buzzwords in science and industry for over a decade. A Medline search shows over a thousand current references to Big Data in healthcare. A good starting point is consider what is meant by Big Data and then discuss the implications. A quick scan of the references shows that they vary greatly in technical complexity. A standard definition from Google is: "extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions." These techniques developed because of the widespread availability of digitized data and the ease with which sets of behavioral choices (in the form of mouse clicks) can be collected on web sites. In many cases specific data collection paradigms can be used to elicit the information, but there is also a wealth of static data out there as well. In health care, any electronic health record is a massive source of static data. Financial, real estate, educational records and records of all of these transactions are also a significant source.
Most Americans logging in to set up a Social Security account online (ssa.gov) in the past couple of years would be surprised at what it takes to complete the job. After the preliminary information there is a set of 5 security questions. Four of those questions are about your detailed personal credit history - home mortgage information and credit card history. When Social Security was initially set up in 1936 there was widespread concern that Social Security Numbers would become national identifiers. At one point Congress had to assure the electorate that the number would not be used for that purpose. Since then the SSN has been used for multiple identification purposes including credit reporting. At this point it seems that we have come full circle. Congress invented the SSN and told people it would not be an identifier. They mandated its use as an identifier. Congress authorized and basically invented the credit reporting system in the United States. The federal government currently uses the credit reporting system to quiz taxpayers wanting to set up a Social Security account online. In the meantime, large amounts of financial, legal, and health care data are being collected about you under your SSN in data systems everywhere. At this point the full amount of that data and the reasons why it is being collected for any person in the US is unknown because it is all collected without your knowledge or your consent. It is impossible to "opt out" from this data collection. The federal government does have an initiative to remove SSNs from health records, but there are so many other identifiers out there right now, this effort is too little and too late.
Additional sources of data include your online foot print including sites that you may have visited and what you seem to be interested in. A visit to Amazon for example and a quick look at an expensive digital camera may result in that same camera with a link to Amazon in the margin of very other web page you see for the next two weeks. Expensive digital cameras of a different brand than the one you originally looked at may start showing up. You may notice product ads showing up in your Facebook feed that you mentioned casually to your friends during a conversation there. The conversation could be as generic as bicycle seats and suddenly you are seeing a flurry of ads for bicycle seats. Any number of web sites encourage to sign in with other accounts and then share your account information with them. All of this data provides companies with what they need to fuel their predictive algorithms to sell you a product. It provides the major advertisers in this space like Google and Facebook with a huge revenue source because based on the scale and personalization of these ads - they are effective. Big Data seems to be very good for business. But is there a downside?
That brings me to a current resource on the nefarious uses of Big Data written by an expert in the field. Cathy O'Neil is a PhD in mathematics. Her PhD work was in algebraic number theory. She started work as an academician but subsequently worked for a hedge fund, work as a data scientist for several firms and currently heads the Lede Program in Data Journalism at Columbia University. I am familiar with her work through her blog MathBabe. Her newly released book Weapons of Math Destruction takes a look at the dark side of Big Data specifically how data collection and biased algorithms can be good for administrators, politicians and business but bad for anyone who falls under the influence of those agencies and their work. In the introduction she leads of with the example of teacher assessments. I was familiar with a scattergram that she had posted on her web site showing that year to year teacher assessment scores were essentially uncorrelated or random. In the book she describes the human toll in this case a teacher fired because of this defective algorithm. In another example later in the book, an experienced teacher scored a 6 out of 100 on a "value-added" teacher evaluation. Only tenure kept him from getting fired. The scoring algorithm was opaque and nobody could tell him what had happened. The next year he scored the 96 out of 100. But the algorithm was so flawed he knew that score was no more legitimate than the last one. With the politicized environment surrounding teaching the proponents of teacher "accountability" like this variation since it fits their ideas about the system retaining incompetent teachers that need to be weeded out. In fact, the algorithm is defective and like many is based on erroneous assumptions.
I personally know that physicians are subjected to the same processes as teachers, but so far it is less technologically advanced. O'Neill points out that there is nothing magical about algorithms. That they frequently incorporate the biases of the people who design and contract for them. Opacity and a lack of correction by feedback is another feature. I worked for the same employer for a number of years when physician "accountability" measures were put in place. The "algorithm" for salary went something like this RVU Productivity + Outside Billing + Citizenship = Pay. RVUs were the total number of patients seen according the the biased government and managed care billing schemes. Outside billing was any consulting work done outside of the clinical work that was billed through the department. Citizenship included teaching and administrative duties as well as any Grand Rounds or CME lectures that were done. In other words apart from the subjectively based billing scheme all of the inputs are almost totally subjective and influenced by all kinds of pseudoaccountability measures along the way. For example, in parallel with the teacher ranked on the algorithm, I was told one year that I had achieved the top rank in terms of documentation in a group of about 25 physicians. The next year - making no changes at all in terms of that documentation - I was dead last. My conclusion, like the teacher in the example was that the rating scheme was completely bogus and with that kind of a scheme who cares about the results?
The number of based algorithms applied to physicians has eerie parallels to those mentioned in WMD. Here are a few that I picked out on the first read:
1. The algorithm is based on faulty data - the teacher evaluation algorithms were based on a faulty interpretation of data in the Nation at Risk report. The report concluded that teachers were responsible for declining SAT scores between 1963 and 1980. When Sandia Labs reanalyzed the data 7 years later they found that an great expansion in the number of people taking the test was responsible for decreased average score but subgroup analysis by income group showed improved scores for each group (p. 136). The only reason that teachers are still being blamed is political convention. I posted here several years ago that the top ranked students in the world in Finland are taught by teachers who are assumed to be professionals and who are not critiqued on test results.
The parallel in medicine was the entire reason that medicine is currently managed by the government and the healthcare industry. It was based on criticism in the 1980s that doctors were lining their pockets by performing unnecessary procedures and that work quality was poor. That should sound familiar because that criticism has been carried forward despite a major study that showed it was completely wrong. The massive Peer Review Standards Organizations (PRSO) in each state in the 1990s conducted rigorous reviews of all Medicare hospitalizations and concluded that there was so little overutilization and so few quality problems that it would not pay to continue the program. The only reason that managed care companies exist today is by political convention.
2. An effective teacher like an effective doctor is too complex to model - When that happens only indirect measures or "crude proxies" (p. 208) can be used to estimate effectiveness. In medicine like teaching - the proxy measures are incredibly crude. They generally depend on diagnosis, poorly account for comorbid illness, and the outcome measures are heavily influenced by business rather than medical decision making. The best examples are length of stay parameter and readmission parameters. Every physician knows that there are set payment schedules based on the supposed ideal length of stay for a particular illness. The business influence in the discharge decision is so malignant these days that non-physician case managers are present to pressure physicians into discharging patients. If the discharge beats the length of stay parameter - the hospital makes money. I sat in a meeting at one point and asked the obvious question: "OK - we have completed the discharge checklist - do we know the outcomes? How do the patients do when they are discharged by this process? How many of them die?" Dead silence followed. Most people would be shocked to hear that what passes for evidence based medicine is often a checklist that has no meaning in the real world. Making the points on the checklist is good for advertising though.
3. There is a lack of transparency in the overall process - The teachers in WMD who were blindsided by the algorithm were never told how that conclusion was reached. I encountered the same problem in a managed care organization when it was clear to me that administrators with no knowledge of psychiatry were telling us what to do. In some cases, "consultants" were brought in to write reports to confirm the most recent administrative edicts. When I asked my boss if I could talk with the people sending out the edicts I was informed that there was a "firewall" between clinicians and upper management. This lack of feedback is another critical dimension of algorithms gone astray. If you are writing an algorithm biased toward a business goal - why would you want feedback from clinicians? Why would you want any humanity or clinical judgment added especially in the case of psychiatric care? Let's just have a dangerousness algorithm and leave it at that. Those are the only people who get acute treatment, even though it is patently unfair relative to how the rest of medicine works.
Big Data is good for science. We can't do elementary particle physics or genomic analysis very well without it. Big Data is also good for business is much different ways. There are clearly people out there who cannot resist buying items online if the Amazon algorithm shows it to them enough times across a number of web pages. Big data in business can also come up with billing algorithms that have less to do with reality than making a profit. Similar programs can be found for employee scheduling, performance analysis, and downsizing. The problems happens when the business biases of Big Data are introduced to science and medicine. Those techniques are responsible for an array of pseudoquality and pseudoaccountability measures for physicians, hospitals, and clinics.
Unfortunately physicians seem to have given up to the political conventions that have been put upon us. Some administrator somewhere suggests that quality care now depends on a patient portal into an electronic health record and a certain number of emails sent by patient to their physician every month. Across the country that will result in hundreds of millions of emails to physicians who are already burned out creating highly stylized documentation that is used only for billing purposes. Terabytes of useless information that nobody will ever read again - the product of a totally subjective billing and coding process that started over two decades ago. Is there any data that email communication is tied to the effectiveness or technical expertise of the physician? I doubt it. I worked with great physicians long before email existed.
It is about time that somebody pointed out these manipulations provide plenty of leverage for the management class in this country at the expense of everyone else. It is well past the time that doctors should be confronting this charade.
George Dawson, MD, DFAPA
References:
1. Cathy O'Neill. Weapons of Math Destruction - How Big Data Increases Inequality And Threatens Democracy. Crown Publishing Group. New York, NY, 2016. I highly recommend this book for a look at the other side of Big Data. It is written in non-technical language and is very readable.
Attribution:
The photo at the top is a Server Room in CERN By Florian Hirzinger - www.fh-ap.com (Own work (Florian Hirzinger)) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0) or GFDL (http://www.gnu.org/copyleft/fdl.html)], via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File%3ACERN_Server_03.jpg"><img width="512" alt="CERN Server 03" src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d7/CERN_Server_03.jpg/512px-CERN_Server_03.jpg"/></a>
Sunday, February 3, 2013
Big Data and Psychiatry - Moving Past the Mental Status Exam
I was a fan of big data before it became fashionable. I was a high tech investor before the dot.com bubble and became very interested in high speed networking, especially the hardware necessary to move that data around. Even before that information was publicly available, electrical engineers were using that equipment to rapidly download large amounts of data (GB) from satellites on every orbit. As an investor, one of the early flagship applications was large telescopes. I wrote an article on high speed networks and the medical applications - digital radiology and medical records back in 1997. At about the same time I made the information connection.
As a college student, I got my hands on the Whole Earth Catalog. That led me to my small college library and my surprise to find that they had Shannon's seminal work on information theory on the shelf. I was even more excited when I learned about entropy in my physical chemistry course three years later. Since then I have been searching without much success to look at what happens when two people are sitting in a room and talking with one another.
My entire career has been spent talking with people for about an hour and generating a document about what happened. It turns out that the document is stilted in the direction of tradition and government and insurance company requirements. It covers a number of points that are historical and others that are observational. The data is basically generated to match a pattern in my head that would allow for the generation of a diagnosis and a treatment plan. The urgency of the situation can make the treatment plan into the priority. The people who I am conversing with have various levels of enthusiasm for the interaction. In some cases, they clearly believe that providing me with any useful data is not in their best interest. Others provide an excessive amount of detail and as the hour ends I often find myself scrambling to get to critical elements before the hour expires (my current initial interview form has about 229 categories). This basic clinical interview in psychiatry has been the way that psychiatrists collect information for well over a century. In the rest of medicine, the history and physical examination has become less important due to advances in technology. As an example, it is rare to see a cardiologist these days who depends very much on a detailed physical examination when they know they are going to order an echocardiogram and get data from a more accurate source.
In psychiatry, other than information from a collateral interview and old records there is no more accurate source of information than the patient. This creates problems when the patient has problems with recall, motivation, or other brain functions that get in the way of describing their history, subjective state, or impact on their life. The central question about how much useful information has been communicated in the session, the signal-to-noise considerations, and what might be missing has never been determined. The minimal threshold for data collection has never been determined. In fact, every information specialist I have ever contacted has no idea how these variables might be determined.
Information estimates have become more available over the past decade ranging from estimates of the total words spoken by humans in history to the total amount of all data produced in a given year. Estimates of total words ever spoken range from 5 exabytes to 42 zettabytes depending on whether the information is stored as typewritten words on paper or 16-bit audio. That 8,400 fold difference illustrates one of the technical problems. What format is relevant and what data needs to be recorded in that format? The spoken word whether recorded or typed is one channel but what about prosody and paralinguistic communication? How can all of that be recorded and decoded? Is there enough machine intelligence out there to recognize the relevant patterns?
An article in this week's Nature illustrates the relative scope of the problem. Chris Mattmann makes a compelling argument for both interdisciplinary cooperation and training a new generation of scientists who know enough computer science to analyze large data sets. He gives the following examples of the size of these data sets: ( one TB = 1,000 GB)
That means that the SKA is nearly producing the total amount of information spoken by humans (recorded as 16-bit audio) in recorded history every year. The author points out that the SKA will produce 700 TB of data per second and within a few days will eclipse the current size of the Internet!
All of this makes the characterization of human communication even more urgent. We know that the human brain is an incredibly robust and efficient processor. It allows us to communicate in unique and efficient ways. Even though psychiatrists focus on a small area of human behavior during a clinical interview the time is long past due to figure out what kind of communication is occurring there and how to improve it. It is a potential source of big data and big data to correlate with the big data that is routinely generated by the human brain.
George Dawson, MD, DFAPA
Dawson G. High speed networks in medicine. Minnesota Physician 1997.
Lyman, Peter, H. Varian, K. Swearingen, P. Charles, N. Good, L. Jordan, & J. Pal. 2003. How Much Information? Berkeley: School of Information Management & Systems.
Mattmann CA. Computing: A vision for data science. Nature. 2013 Jan 24;493(7433):473-5. doi: 10.1038/493473a.
Shannon CE. A mathematical theory of communication. The Bell System Technical Journal 1948; 27(3): 379-423.
As a college student, I got my hands on the Whole Earth Catalog. That led me to my small college library and my surprise to find that they had Shannon's seminal work on information theory on the shelf. I was even more excited when I learned about entropy in my physical chemistry course three years later. Since then I have been searching without much success to look at what happens when two people are sitting in a room and talking with one another.
My entire career has been spent talking with people for about an hour and generating a document about what happened. It turns out that the document is stilted in the direction of tradition and government and insurance company requirements. It covers a number of points that are historical and others that are observational. The data is basically generated to match a pattern in my head that would allow for the generation of a diagnosis and a treatment plan. The urgency of the situation can make the treatment plan into the priority. The people who I am conversing with have various levels of enthusiasm for the interaction. In some cases, they clearly believe that providing me with any useful data is not in their best interest. Others provide an excessive amount of detail and as the hour ends I often find myself scrambling to get to critical elements before the hour expires (my current initial interview form has about 229 categories). This basic clinical interview in psychiatry has been the way that psychiatrists collect information for well over a century. In the rest of medicine, the history and physical examination has become less important due to advances in technology. As an example, it is rare to see a cardiologist these days who depends very much on a detailed physical examination when they know they are going to order an echocardiogram and get data from a more accurate source.
In psychiatry, other than information from a collateral interview and old records there is no more accurate source of information than the patient. This creates problems when the patient has problems with recall, motivation, or other brain functions that get in the way of describing their history, subjective state, or impact on their life. The central question about how much useful information has been communicated in the session, the signal-to-noise considerations, and what might be missing has never been determined. The minimal threshold for data collection has never been determined. In fact, every information specialist I have ever contacted has no idea how these variables might be determined.
Information estimates have become more available over the past decade ranging from estimates of the total words spoken by humans in history to the total amount of all data produced in a given year. Estimates of total words ever spoken range from 5 exabytes to 42 zettabytes depending on whether the information is stored as typewritten words on paper or 16-bit audio. That 8,400 fold difference illustrates one of the technical problems. What format is relevant and what data needs to be recorded in that format? The spoken word whether recorded or typed is one channel but what about prosody and paralinguistic communication? How can all of that be recorded and decoded? Is there enough machine intelligence out there to recognize the relevant patterns?
An article in this week's Nature illustrates the relative scope of the problem. Chris Mattmann makes a compelling argument for both interdisciplinary cooperation and training a new generation of scientists who know enough computer science to analyze large data sets. He gives the following examples of the size of these data sets: ( one TB = 1,000 GB)
Project
|
Size
|
Encyclopedia of DNA Elements (ENCODE), 2012
|
15 TB
|
US National Climate Assessment (NASA projects), 2013
|
1,000 TB
|
Fifth assessment report by the Intergovernmental Panel on Climate
Change (IPCC), due 2014
|
2,500 TB
|
Square Kilometer Array (SKA), first light due 2020
|
22,000,000,000 TB per year
|
That means that the SKA is nearly producing the total amount of information spoken by humans (recorded as 16-bit audio) in recorded history every year. The author points out that the SKA will produce 700 TB of data per second and within a few days will eclipse the current size of the Internet!
All of this makes the characterization of human communication even more urgent. We know that the human brain is an incredibly robust and efficient processor. It allows us to communicate in unique and efficient ways. Even though psychiatrists focus on a small area of human behavior during a clinical interview the time is long past due to figure out what kind of communication is occurring there and how to improve it. It is a potential source of big data and big data to correlate with the big data that is routinely generated by the human brain.
George Dawson, MD, DFAPA
Dawson G. High speed networks in medicine. Minnesota Physician 1997.
Lyman, Peter, H. Varian, K. Swearingen, P. Charles, N. Good, L. Jordan, & J. Pal. 2003. How Much Information? Berkeley: School of Information Management & Systems.
Mattmann CA. Computing: A vision for data science. Nature. 2013 Jan 24;493(7433):473-5. doi: 10.1038/493473a.
Shannon CE. A mathematical theory of communication. The Bell System Technical Journal 1948; 27(3): 379-423.
Subscribe to:
Posts (Atom)