Sunday, December 9, 2018

What Isn't Available In Multimillion Dollar EHRs? Decision Support from 1994

Physician Decision Support Software from the 20th Century

I used to teach a class in medical informatics. My emphasis was not mistaking a physical illness for a psychiatric one and also not missing any medical comorbidity in identified psychiatric patients.  The class was all about decision-making, heuristics, and recognition of biases that cause errors in medical decisions. Bayesian analysis and inductive reasoning was a big part of course. About that time, software packages were also available to assist in diagnostic decisions. Some of them had detailed weighting estimates to show the relative importance of diagnostic features.  It was possible to enter a set of diagnostic features and get a listing of probable diagnoses for further exploration. I printed out some examples for class discussions and we also reviewed research papers and look at the issue of pattern recognition by different medical specialists.

The available software packages of the day were reviewed in the New England Journal of Medicine (1).  In that review, 10 experts came up with 15 cases as written summaries and then those cases were cross checked for validity and pared down to 105 cases.  The four software programs (QMR, Iliad, Dxplain, and Meditel) were compared in their abilities to suggest the correct diagnosis. Two of programs used Bayesian algorithms and two used non-Bayesian algorithms. The authors point out that probability estimates varied based on literature clinical data used to establish probabilities. In the test, the developers of each program were used to enter the diagnostic language and the compared outcomes were the list of diagnoses produced by each program. The diagnoses were rank ordered according likelihood.

The metrics used to compare the programs was correct diagnosis, comprehensiveness (in terms of the differential diagnosis list generated), rank, and relevance.  Only 0.73-0.91 of the programs had all of the cited diagnoses in the knowledge base. Of the programs 0.52 - 0.71 made the correct diagnosis across all 105 cases and 0.71-0.89 made the correct diagnosis across 63 cases.  The 63 case list was used because those diagnoses were listed in all 4 knowledge bases.  The authors concluded the lists generated had low sensitivity and specificity but that unique diagnoses were suggested that the experts agreed may be important. They concluded that the performance of these programs in clinical settings being used by physicians was a necessary next step. They speculated that physicians may use these programs beyond generating diagnoses but also looking at specific findings and how that might affect the differential diagnosis.

A study (2) came out five years later that was a direct head-to-head comparison of two different physicians using QMR software to assess 154 internal medicine admissions where there was no known diagnosis.  In this study physician A obtained the correct diagnosis in 62 (40%) cases and physician B was correct in 56 (36%) of the cases. That difference was not statistically significant. Only 137 cases had the diagnosis listed in the QMR knowledge base. Correcting for that difference, correct diagnoses increased to 45% for physician A and 41% for physician B. The authors concluded that a correct diagnosis Listed in the top five diagnoses 36 to 40% of the time was not accurate enough for a clinical setting, but they suggested that expanding the knowledge base would probably improve that rate.

Since then the preferred description of this software has become differential diagnosis generators (DDX) (3.4). A paper from 2012, looked at a total of 23 of these programs but eventually included only 4 for in their analysis. The programs were tested on ten consecutive diagnosis-focused cases chosen from from 2010 editions of the Case Records of the New England Journal of Medicine (NEJM) and the Medical Knowledge Self Assessment Program (MKSAP), version 14, of the American College of Physicians. A 0-5 scoring system was developed that encompassed the range of 1= diagnosis suggested on the first screen or first 20 suggestions to 5= no suggestions close to the target diagnosis. The scoring range was 0-50. Two of the programs exactly matched the diagnosis 9 and 10 times respectively. These same two programs DxPlain and Isabel had identical mean scores of 3.45 and were described as performing well. There was a question of integration with EHRs but the authors thought that these programs would be useful for education and decision support. They mention a program in development that automatically incorporates available EHR data and generates a list of diagnoses even without clinician input.

The most recent paper (4) looked at a a systemic review and meta-analysis of differential diagnosis (DDX) generators. In the introductory section of this paper the authors quote a 15% figure for the rates of diagnostic errors in most areas of medicine. A larger problem is that 30-50% of patients seeking primary care or specialty consultation do not get an explanation for their presenting symptoms. They looked at the ability to generate correct lists of diagnosis, whether the programs were as good as clinicians, whether the programs could improved the clinicians list of differential diagnoses, and the practical aspects of using DDX generators in clinical practice. The inclusion criteria resulted in 36 articles comparing 11 DDX programs (see Table 2.)  The original paper contains a Forest Plot of the results of the DDX generators showing variable (but in some cases high) accuracy but also a high degree of heterogeneity across studies.  The authors conclude that there is insufficient evidence to recommend DDX generators based on the variable quality and results noted in this study.  But I wonder if that is really true.  Some of the DDX generators did much better than others and one of them (Isabel) refers to this study in their own advertising literature.

My main point in this post is to illustrate that these DDX generators have been around for nearly 30 years and the majority of very expensive electronic health record (EHR) installations have none.  The ones that do are often in systems where they are actively being studied by physicians in that group or one has been added and the integration in the system is questionable.  In other words, do all of the clinical features import into the DDX generator so that the responsible clinician can look at the list without making that decision.  At least one paper in this literature suggests that eliminates the bias of deciding on whether to not to make the decision to use diagnostic assistance.  In discussion of physician workflow, it would seem that would be an ideal situation unless the software stopped the workflow like practically all drug interaction checking software.

The drug interaction software may be a good place to start. Some of these program and much more intelligent than others. In the one I am currently using trazodone x any serotonergic medication is a hard stop and I have to produce a flurry of mouse clicks to move on.  More intelligent programs do not stop the workflow for this interaction of SSRI x bupropion interactions.  There is also the question of where artificial intelligence (AI) fits in.  There is a steady stream of headlines about how AI can make medical diagnoses better than physicians and yet there is no AI implementation in EHRs designed to assist physicians.  What would AI have to say about the above drug interactions? Would it still stop my work process and cause me to check a number of exception boxes? Would it be able to produce an aggregate score of all such prescriptions in an EHR and provide a probability statement for a specific clinical population?  The quality of clinical decisions could only improve with that information. 

And there is the issue of what psychiatrists would use a DDX generator for?  The current crop has a definite internal medicine bias.  Psychiatrists and neurologists need an entirely different diagnostic landscape mapped out.  The intersection of psychiatric syndromes, toxidromes and primary neurological disorders needs to be added and expanded upon. As an experiment, I am currently experimenting with the Isabel package and need to figure out the best way to use it.  My experimental paradigm is a patient recently started on lithium who develops an elevated creatinine, but was also started on cephalexin a few days after the lithium was started.  Entering all of those features seems to produce a random list of diagnoses and the question of whether an increasing creatinine is due to lithium or cephalexin.  It appears that the way the diagnostic features are entered may affect the outcome.

Decision support is supposed to be a feature of the modern electronic health record (EHR). The reality is the only decision support is a drug interaction feature that varies greatly in quality from system to system. Both the drug interaction software and DDX generators are very inexpensive options for clinicians.  EHRs don't seem to get 1990s software right.  And it does lead to the question: "Why are EHRs so expensive and why do they lack appropriate technical support for physicians?" 

Probably because they were really not built for physicians.

George Dawson, MD, DFAPA


1: Berner ES, Webster GD, Shugerman AA, Jackson JR, Algina J, Baker AL, Ball EV, Cobbs CG, Dennis VW, Frenkel EP, et al. Performance of four computer-based diagnostic systems. N Engl J Med. 1994 Jun 23;330(25):1792-6. PubMed PMID:8190157.

2: Lemaire JB, Schaefer JP, Martin LA, Faris P, Ainslie MD, Hull RD. Effectiveness of the Quick Medical Reference as a diagnostic tool. CMAJ. 1999 Sep 21;161(6):725-8. PubMed PMID: 10513280.

3: Bond WF, Schwartz LM, Weaver KR, Levick D, Giuliano M, Graber ML. Differential diagnosis generators: an evaluation of currently available computer programs. J Gen Intern Med. 2012 Feb;27(2):213-9. doi: 10.1007/s11606-011-1804-8. Review. PubMed PMID: 21789717.

4: Riches N, Panagioti M, Alam R, Cheraghi-Sohi S, Campbell S, Esmail A, Bower P.The Effectiveness of Electronic Differential Diagnoses (DDX) Generators: A Systematic Review and Meta-Analysis. PLoS One. 2016 Mar 8;11(3):e0148991. doi: 10.1371/journal.pone.0148991. eCollection 2016. Review. PubMed PMID: 26954234.

No comments:

Post a Comment