OpenEvidence is a technological Trojan horse at the gates of clinical practice
Go to openevidence.com and you will see, right under the elegant logo and a free text box prompting you to ask a medical question, an immodest tag line: “America’s Official Medical Knowledge Platform”. The boast sits above an enviable lineup of official partners: The New England Journal of Medicine, Journal of the American Medical Association, National Comprehensive Cancer Network, Cochrane Systematic Reviews. If you were a clinician in need of information these would be the first places to go, [Note: Save, perhaps, for a few journals in the JAMA network, and I write this as someone who has published in and reviewed for JAMA. ] but now there is no need because OpenEvidence will do it for you, for free and — unlike those poor community doctors whose practices can’t afford an NEJM subscription — with full access to all those journals.
Their About page is even more effusive. “Our mission is to help doctors save lives and improve patient care.” Great! It goes on:
This year, more than 100 million Americans will be treated by a clinician using OpenEvidence. As a product, OpenEvidence is an AI copilot for doctors that helps them make high-stakes decisions at the point of care. OpenEvidence is the most widely used medical AI among verified U.S. clinicians. To date, we have supported over 200 million AI-powered clinical consultations from U.S. doctors and other frontline clinicians.
In a remarkably short period of time, OpenEvidence has become the default operating system of medical knowledge in the United States.
Underneath lies the Team, laden with Harvard and MIT affiliations, and long list of medical advisors ranging from Mayo, Hopkins and Mass General staff to prominent YouTubers.
It was a rather obvious idea, to create a specialized LLM chatbot which restricts its data sources to medical literature only, so when I first saw OpenEvidence, the way it presented itself (partnership with NEJM and JAMA, MIT affiliation) and the price (free for everyone with an NPI) I was pleasantly surprised that these institutions came together for the common good, to create our generation’s PubMed.
Hardy har har.
Scroll further down and under another immodest headline — “Supported by the Best” — sit the logos of Sequoia Capital, Kleiner Perkins, Blackstone, Andreessen Horowitz, Nvidia, Google Ventures and the like. Not listed on the website because there is no “Investor relations” page — that may spook the clinicians! — is the financial history. Earlier this year it raised $250 million in a Series D round at $12 billion valuation. Just three months before that it raised $200 million at $6 billion valuation. In total, it has received close to $700 million in funding over its four years of existence.
Yes, OpenEvidence, “the default operating system of medical knowledge in the United States” (their words, emphasis included), is a tech startup zipping through the first phase of enshittification, i.e. attracting users with a high-quality offering. I would argue that even the “high-quality offering” is a bit of a crock, but we’ll come back to that shortly. Let’s, for the purposes of this paragraph, go with the premise that the unique thing that OE provides is the “artificial intelligence” portion. Well, from what I understand the company relies on OpenAI, Anthropic and others for the actual compute and if that is the case they are one-step removed from the absolute carnage whose genesis Ed Zitron and others have been diligently chronicling. The default operating system of American medicine is an earnings miss away from the blue screen of death.
I won’t cry for the billionaires involved. I will, however, mourn the opportunity cost of so many smart physicians and programmers on their medical and technical teams spending their time on point-one-percenter enrichment instead of truly building our generation’s PubMed. It would not even require compute! The true value of OE is the curated collection and unrestricted access to peer-reviewed journals, treatment guidelines, and systematic reviews, supplements and all. Let me google all that — or better yet, look it up on Kagi — and I will not care at all for the LLM-generated veneer glued onto man-made knowledge. But good luck having NEJM, JAMA et al. open their vaults without the VC-backed carrot of (I suspect) God knows how many millions of dollars for access rights combined with the FOMO stick that Anthropic and OpenAI’s PR teams have been so diligently whittling.
Trigger warning for an LLM-sounding phrase: the mounds of AI slop added to OE search results aren’t just wasteful, they are dangerous. Back in the Triassic era when shmucks like yours truly were nursing their middle-finger calluses writing progress notes by hand you knew that every part of that note contained useful knowledge. With the electronic medical record mandate — thanks, Obama — much of it became an unreadable mix of computer-generated charts and copypasta; you had to look at the end of the note to find actual human thought, whether it is in the Assessment and Plan or the Attending Addendum section. Well, I can report from the front lines that much of the time even that one meager paragraph has become a copy/paste job carrying with it that distinct LLM waft.
I am not against using LLMs for progress notes — we have been using human scribes for decades to write up the facts of the doctor-patient encounter. But those are costly and your rural primary care physician certainly won’t have one, so why not delegate that work to AI? The assessment and plan, however, are where you infuse those facts with meaning and then act on them, which is the entire purpose of the physician’s job. Writing is thinking and millions of US medical professionals have decided to delegate the one job they have to AI while keeping all the moral and legal responsibility, reverse-centauring themselves willingly and with eyes wide open.
This may seem like a “the food is horrible and the portions are too small” joke — have I not just wrote that the whole thing will soon be dead? If you are a physician who values their brain and doesn’t copy off a clanker why should you care if either start relying on them and then get a rug-pull? Three reasons:
- Expectation-setting: those who copy will need 15 minutes per encounter, then 10, then 5, continuing to ingest slop and regurgitate it over patient notes even as it gets increasingly bad from more and more expensive compute.
- Asbestos exposure: as in, AI is the asbestos we are shoveling into the walls of our society, only the asbestos here is in the form of regurgitated slop we are putting into patient medical records. That, too, will take our descendants some time to dig out, although human life span being what it is it should be less than a whole generation.
- Thinking of the kids: some of my own highest yield learning moments were reading the attending addendum on my note, or the dictation of a particularly skilled specialist’s consult note; will the incoming generations of medical students and residents have the same opportunity?
So if your mission truly was to help doctors save lives and you weren’t a greedy son of a bitch would you not have made a non-profit to achieve that goal? It may not have been as slick as something coming out of Silicon Valley, but it would also not have the risk of blowing up if the financial winds turn and the funding flywheel stops spinning. After all, there have been many attempts to replace the government-funded Medline/PubMed combo, but none of them were that much (if at all) better to justify the cost.
🕹️ The fearless style in gaming
Back in Pleistocene when I was in grade school and still had dozens of hours per week to spend on PC games and when computing power and storage space were so precious that even quick-save — forget about autosave as the feature was yet to be even a glimmer in game developers’ eyes — were just not a thing, my obsessive tick which I repeated every 10 or so minutes was to stop whichever game I was playing at the time, go to the menu, and save my spot, “just in case”. This is when saving the game carried weight and you could name each time point, all of mine being named just that: “just in case”, or rather, its abbreviated Serbian equivalent “ZSS”. I rarely ever reloaded since these were mostly Sierra and LucasArts-style point-and-click adventures in which it was quite literally impossible to get stuck or make a wrong move, or early RPGs like Lands of Lore which, OK, had its challenging moments, but certainly not enough to warrant 10-min saves.
Fast forward three decades, when my own children have in that regard fallen so far away from the tree that they’ve landed on the gaming Moon
[Note:
Fallen upward, clearly. There is nothing wrong with this metaphor. Please carry on.
]
where the abundance of autosave has destroyed any spot-saving reflex they could have possibly had and with it also any sense of dread for things to come. I was reminded of this stark difference after JTR wrote about his own fears playing Subnautica. There is a link there but it leads to a “Page Not Found” even though the article came through in the RSS feeds and I don’t know if this was a feed-only post or a Micro.blog bug so
[Note:
Mystery solved, please see @jtr’s comments below.
]
I will quote liberally here:
The first time I played, the game had the element of surprise. I remember my first reaper: it came out of nowhere and grabbed my Seamoth like a plaything. I yelped, slammed the Alt+F4 keys, and stomped out of my room as white as the hallway wall I was leaning against, mumbling “oh my god” over and over. Now I know better. I know where they are, I can see them in the distance, and… I’m still scared. But I go ahead anyway. The fear is not pushing me away; it’s teaching me to be prepared. The only thing that’s really scary is fear itself.
See, I never played Subnautica but if I did my playing style would be not much different from JTR’s. Not so for my (nearly) seven-year-old who zips over and across and around reapers and dies and loses some of his supplies and shrugs his just-out-of-toddlerhood shoulders and gets at it again. And if you think that’s because he’s too young to know better you haven’t seen his older sisters play a time challenge level of Astro Bot, a game which severely punishes any hesitation, unnecessary pausing and haphazard jumping.
This is one way of many ways in which the kids are better than their parent, and I put much credit in the autosave abundance!
Time to get a new-old dictionary
The power of the Internet is that, under an innocuous title such as Sdcv-quick Update, on a mostly technical blog dedicated to Emacs, one can find a most delightful essay by James Somers — from way back in 2014, the days of still-capitalized Internet — about the power of the old Webster’s Dictionary, how it outshines its modern successors, and how, wonder of wonders, you can download and install Webster’s Revised Unabridged Dictionary (1913 + 1828) onto your computing device of choice. [Note: But not Emacs. That’s where quick-sdcv.el comes in! ]
Somers does not quite reach the heights of David Foster Wallace’s Authority and American Usage but then he also takes only about a fifth of the space to make his point. [Note: If these two weren’t enough, “Draft No.4” by John McPhee will do nicely to meet your dictionary essay needs. ] His whole blog makes for great reading, most of it having been written pre-LLMs. This is important: Somers is a professional writer whose most recent articles in The New Yorker and The Atlantic keep glazing AI. If you think I am exaggerating, here are some of the more recent titles: “The Coming Software Apocalypse”, “The Scientific Paper is Obsolete”, “How Will A.I. Learn Next?”, “A Revolution in How Robots Learn”, “The Case That A.I. Is Thinking”… With this kind of coverage, who needs a marketing department?
Against numbers
In the preamble to his Morose thoughts at the Semiquincentennial, @ReaderJohn notes:
I’m on a social medium (I refuse to abuse the plural “media”) with an astonishing number of people, many of them decades younger than me, who manage, without coming across as idiots (au contraire: I’m struck by how many there make me feel unobservant and thick-skulled about what I do observe), to focus on positive, and personal, and local things. Kudos to its designer, who consciously designed it that way (I’m not sure how, except that one never knows how many people follow him or her, and there are no buttons to simply “like” a post).
That last parenthetical is, I believe, exactly the reason why micro.blog turned out the way it did. My first thought was that it filtered out people who liked to see numbers go up — many of them not of the clearest mind — right at the outset. But that is not all there is to it, probably not even the most important part. The intentional lack of statistics cuts the feedback loops which tend to make some people into complete assholes, and every person into an occasional asshole. [Note: Or at the very least an asshole-appearing online presence, but to the exposed person — meaning you, dear reader — there is no difference. ]
Every popularity contest will reward the extremes. This is why I gave up following the Bear Blog Discovery feed. Random posts from to-me unknown authors just popping into my RSS reader [Note: These days a combination of the [Inkwell][4] Android app on my Daylight tablet and my own homebrewed [Inkling for Emacs][5], which is where I’m writing this! ] reminded me too much of Twitter’s algorithms, and even Bubbles — posts from 5,000+ independent blogs, including this one, ranked by timeliness and popularity — favors criticism of AI and tech in general combined with outrage/despondence/resignation towards news of the day/breakage of everyday life/civilizational decline. The only ever Infinite Regress post that ended up on the Bubbles front page fits right in. [Note: A kind reader even uploaded it to Hacker News, where it — thankfully! — received just 4 upvotes and no comments. Small mercies. ] It is, in that sense, no different from Reddit: the medium (of voting) is the message.
Incidentally, these Bubbles and Hacker News and Kagi Small Web and indieblog.page and ooh.directory visitors all leave footprints on this here web site’s Tinylytics dashboard, which has become delightfully uninterpretable owing to the influx in the past few months of what I can only assume are digital ghosts from Hong Kong, Singapore, China and Mexico, in that order. An unexpected benefit of LLM crawlers.
There is something about numbers that makes people’s brains stop working. This is common in medicine, where reflexively treating a lab abnormality without thinking an iota about the patient or even about the ground truth — is this number here “real” or is it a blood collection/lab analysis/data entry error? — is a phase most doctors go through and some never leave. Call it video game brain: confusing the hardcoded information of an RPG stat or a FPS health bar with more malleable values we get from physical measurements.
Well, I know enough about myself not to expect an effortless change in behavior. The effort tank being depleted daily by issues more pressing, I avoid having to interpret these numerical tricksters in any way I can. You see, for feedback to be of any use there has to be effort somewhere and by making leaving it effortless (thumbs up or down? how good was our service from 1 to 10? the text field is optional!) we have made interpreting it seemingly straightforward but in fact harder. Did someone “like” a blog post because reading it was a life-changing experience? Slightly more amusing than the cat photo just below? A toilet seat mistap? Or was it herding?
Now think about all those feedback surveys you started filling because the first page was a deceptive 1–10 scale only to abandon it because page 2 had five large fields for free text, all with a mandatory character count. This puts majority of the cognitive effort on the feedback provider; reading it does take more time than glancing at a number, but the receiver can quickly and effortlessly tell whether it is a) from someone whose opinion they care about and b) what the said opinion is.
So yes this is a long-winded way to nudge you towards writing more emails. Or leaving more comments. Or even starting your own blog. More words, fewer numbers, please. And yes, yes, I am aware how silly asking for more words sounds in these, our Days of Slop. But to go back to the blog post that started all this, and then two links deep to a most brilliant text from Sam Kriss: when everyone from your middle manager bosses to Guardian journalists to prize-winning authors and random tech folk debase themselves with AI, the value of the human-written word does in fact go up.
Friday links, with questions and lists
- Kyla Scanlon: Is AI Going to Destroy our Lives or Not? It has been two months since Scanlon last posted so this is a long update that doesn’t cover much more than Ed Zitron recently did, albeit in a — to me, at least — more readable style. The last section, titled “What Do You Do About It”, may be useful to many.
- Reese Richardson: How much of Thermo Fisher’s antibody data has been manipulated? A lot of it, it looks like. I wonder how many poor PhD students and postdocs wasted weeks and months of effort because the antibodies Thermo Fisher Scientific sold weren’t as specific or sensitive as they claimed. Surely as a publicly traded company TFS should have known better. Must have been a rogue employee, or a hundred. Caveat emptor!
- John Siracusa: EV Stupidity Checklist. Twelve physical items that have gotten worse since more cars became electric, with recommendations on how to go back to saner designs. You know, ones that include a rear window
- Brian Philips for The Ringer: The 40 Most Rage-Inducing Problems in Tech. [Note: ᔥAlan Jacobs ] A broader list and more tongue-in-cheek, but items 37 through 39 cut deep.
Have a great weekend!
If it looks like a press release and reads like a press release, why is it being sold as a government report?
Doc in a Box from Alex Tabarrok links to an official state government document, from the Utah Department of Commerce. The document is titled “Key Statistics on the Doctronic Pilot Program” but reads more like a bulleted press release, full of percentages without a denominator, begging for a flow chart. Press releases are like that because you typically won’t add images — although this one randomly selected from today does indeed include it along with the full abstract submitted to the ASCO annual meeting, and good for them — but more importantly because you want to pick the best possible picture-perfect view of your shiny spotless data elephant without also acknowledging that it has a rear end, a bunch of flies buzzing around, smells a bit rank. Does your elephant not have an ass, Utah? Or did you just copy/paste what Doctronic — a startup whose wonky web page doesn’t even work — sent you?
Doctronic.
We have hit a technical snag. Go to Homepage to hit it again.
So how many patients could they have evaluated? This article in JAMA Forum says that “[p]hysicians hired by Doctronic will review the AI’s output for the first 250 patients before the system takes any action and will review the next 1000 patients retrospectively after the AI agent begins acting autonomously.” Are the key statistics from the first 250? The very first bullet point in the press release summary document says that the program is still in Phase One and that “the number of patients so far is limited”, so I guess not. Is it 100 at least? Surely they wouldn’t use a percentage as high as 97 if there were fewer than that involved. Except that as low as 30 will give you a percent roundable to 97. So, 30 to 249?
Why am I being so pedantic? Well, these techniques are par for the course in biotech world but coming from a state agency make me think there is a bit too much enthusiasm for it, coming from a government source. Compare and contrast to the shellacking LLMs got in this report from the Office of the Auditor General of Ontario, which reviewed AI Scribe functionality from 20 vendors. Their report even has absolute numbers in it! These state government officials should realize that they are prime targets for flim flam merchants and should behave accordingly.
Note that I am not against the idea in general. The project’s goal is in fact quite noble: there is no reason why plain ol’ machine learning shouldn’t be able to suss out majority of refill requests for chronic medications and flag patients who haven’t had their bloodwork or diabetic foot assessments done, or who’ve had abnormal office blood pressure readings at prior visits. Having that easy refill option available would mean a patient coming in for an in-person visit for what should be “only” prescription refills is even more of a signal that something else may be amiss, even if the patient can’t or won’t verbalize it. So yes, LLM refills, bring ’em on. Doctronic’s end-goal of actual autonomous Shoggoths putting on white coats and replacing MDs, PAs, NPs and other credentialed humans… not so much.
Tuesday links bonanza
- Kevin Kelly: Your Most Improbable Life.
Your life’s goal should be to become the most improbable person you can be. Your path, your character, your life, should be the most unlikely, the most unexpected, the least predictable version you can make. Improbable lives have fewer competitors, more unique rewards, and are harder to replace with AIs, since AIs run on the predictable. This is true whether you favor traditional humanist directions or work on a frontier.
This is a nice preamble to a bit of personal news I can finally share: I will soon be going back [Note: It is a qualified “back”, as I have never actually practiced medicine full time, being either in training, doing clinical research as my main job, or being out of clinic altogether save for a few hours a week doing charity work. ] to the practice of clinical medicine. This week is in fact the last in my current position, which had been a magnificent experience but was going, as the careful reader of this blog would have already noted, in a direction not entirely suited to my preferred lifestyle and more importantly — let’s not sugarcoat it — values and beliefs. Onwards and upwards!
- Derek Robertson: 5 questions for Meredith Whittaker. [Note: ᔥVoline on Mastodon ]
Whittaker, who is the president of the Signal Foundation (as in the app), had this to say about venture capital back in 2023:
Venture capital looks at valuations and growth, not necessarily at profit or revenue. So you don’t actually have to invest in technology that works, or that even makes a profit, you simply have to have a narrative that is compelling enough to float those valuations. So you see this repetitive and exhausting hype cycle as a feature in this industry. A couple of years ago, you would have been asking me about the metaverse, then last year, you would have asked me about Web3 and crypto, and for each of these inflection points there’s an Andreessen Horowitz manifesto.
It’s not simply that one piece of technology is overhyped, it’s that hype is a necessary ingredient of the current business ecosystem of the tech industry. We should examine how often the financial incentive for hype is rewarded without any real social returns, without any meaningful progress in technology, without these tools and services and worlds ever actually manifesting. That’s key to understanding the growing chasm between the narrative of techno-optimists and the reality of our tech-encumbered world.
Emphasis is mine, as it could be transposed word-for-word into the current world of drug development. Consider it a more polite rewording of prof. Taleb’s take.
- Venkatesh Rao: Commodity Intelligence.
Commodified knowledge is “general knowledge” in the sense tested by trivia/quiz contests. In grade school, we actually had a subject on the curriculum called “GK” and kids good at it (I was one of them) got put on quiz teams to represent their class or school. General intelligence of the sort we actually have today is simply AIs trained on general (ie commodified) knowledge.
But the theological motte-and-bailey move that conflates it with some totalizing-universal divine-omniscience idea of “Artificial General Intelligence” traps a great many of even the smartest people. A category error motivated by theological yearnings, validated by second-order Labatutian psychoses, sustained by epistemic bubbles, and encouraged by sketchy business roadmaps that need a story to justify trillion-dollar investments.
This is a charitable way of justifying the AI billionaire panhandlers’ selling of large language models as AGI, even putting the term in official titles. Less charitably, they all know what Yann LeCun has been saying for years: LLMs will never reach human level of intelligence (“ChatGPT, make me a sandwich”). Whether LeCun’s own pursuits are wise is a different matter.
Separately, Rao gives some good book tips and Benjamin Labatut’s When We Cease to Understand the World is now on the Pile.
- Andrew Gelman: Don’t cite sources you haven’t read, and don’t trust when people claim to be reporting something from the literature.
No quotes because, true to form, everything salient is already in the title. Natural continuation of the debate started last week (see the last link), although apparently written before the new arXiv policy for a 1-year ban for hallucinated references.
- Kieran Healy: Zero Sum Problems.
Healy wrote a book about data visualization so I feel somewhat foolish in writing this, but I do not find Apple Sports’ presentation least bit confusing: the numbers are absolute, the bars show percentage of the total. If the goal is to have more of each (assists, rebounds, steals, etc.) the bigger bar shows the opposing team’s dominance. It’s fine. Healy’s proposed solutions are all notably uglier and demote low-occurrence events like blocks and steals even though they may be crucial in a game. Shows how little both Healy and Gruber — on whose post Healy riffs — know about the game of basketball.
- Sarah Kliff, Margot Sanger-Katz, Erin Schaff and Asmaa Elkeurti for The NYT: Short Naps, Long Hours: How Autism Clinics Squeeze Medicaid Dollars Out of Preschoolers.
At Compleat Kidz, a fast-growing chain of autism clinics based in North Carolina, the policy is firm: Naps cannot be longer than seven minutes before children are awakened to resume therapy. The company says this is necessary to prevent fraud since clinics can be paid only when children are awake and getting services. But it also allows the clinic to bill insurers or Medicaid for more hours.
Yes, you have read that correctly. Waking up a child after a 7-minute nap to perform “therapy” — as if anything meaningful can be accomplished in that hypnagogic state — is both cruel and unusual. But not a punishment! It is merely a way to avoid fraud while optimizing revenue under the watchful eye of private equity:
Private equity firms have acquired at least 500 clinics over the past decade. “There’s just huge opportunities to grow these businesses and help increase access to care,” said Jon Krieger, a managing partner at Calex, a financial firm that assists with autism clinic mergers and acquisitions. He estimates the market could grow to $90 billion.
Mr. Market is a bad doctor, an even worse vet and, it seems, a most diabolical nanny.
Apple decoupling update: replacing DEVONthink Pro
DEVONthink Pro — henceforth DTP — is one of those Swiss Army Knife applications that is different things to different people. [Note: At one point or another DTP has also been my app for journaling, structured note-taking (even bought a book about it), managing journal references, reading RSS feeds, archiving podcast episodes, batch renaming and automated file wrangling à la Hazel. Phew. ] After a dozen or so years of dabbling, I have pared down its use cases to a single one: managing documents both electronic (office files and emails, mostly) and physical (thanks to the now discontinued but still phenomenal ScanSnap iX500). The only reason I used DTP and not say Finder was its “intelligent” file sorting, or rather sorting recommendations paired with fast search. So, that was the only thing I had to replicate to get my DTP replacement on Linux now that I am making the slow jump.
And with quite a bit from Google Gemini — again — I think I have this one licked:
- Recoll provides full-text search, with consult-recoll for integration with Emacs (yes, it’s Emacs all the way down).
- Dirvish gives a nice 3-pane view of the file structure.
- A custom python script [Note: Note: this is a completely Gemini-generated piece of code. Use at your own risk or better yet ask your favorite LLM to make one for you ] which a) trains itself on the folder and file structure of my “Documents” folder to b) predict correct classification for a single (via Emacs) or a group of files (via command line). Yes, it’s the old-fashioned AI, a.k.a. Machine Learning, which now runs as a once-hourly cron job on my laptop.
- A custom Emacs function that binds the “c” key while in Dirvish — usually while in my “Inbox” folder — to classify a file and move it by TAB-completion; in the case I am not happy with the few folders it recommended I can start typing and it will fuzzy-find what I wanted.
Rube Goldberg-y? Yes. Does it work? By golly, it does. For 95% of things I need it for it works even better and faster than DTP, which required mouse-dragging to move a file whenever I wasn’t happy with the recommended classification. The only wrinkle left to address is naming conflicts: DTP didn’t care if two completely different files had the same name as it had its own way to track them. Trying to copy a file to a folder that already has it right now doesn’t work, but that should be a quick weekend fix.
🕹️ Battlestar Galactica: Scattered Hopes seems to be a game I would enjoy, and this “review” in Aftermath is laudatory enough, but what on Earth made its author call it a “rougelike”? Tower defense, yes. Roguelike? Nuh-uh. Gaming journalism is going from bad to worse.
First they came for the programmers… Then they came for the doctors. But not really.
Back in September 2023 I noted that the biggest hurdle for AI completely replacing physicians is the physicality of the job. Sure, LLMs are good at giving differential diagnoses and faking empathy once somebody’s problem has been reduced to text, but the art of medicine is in the act of seeing, feeling, smelling, etc. [Note: Although increasingly less so, as doctors and trainees are becoming experts at treating patients in the chart and not those in front of them, making themselves the perfect foils for replacement; queue photo of the old man yelling at clouds. ] If clankers have any hope of replacing humans, they’d better get some senses.
At first glance, a recent Nature Medicine paper aimed to do just that by introducing what the group of authors — all of them Google employees based in the UK and California — call “multimodal reasoning” but is in fact the chatbot being able to interpret images, ECGs and lab reports in addition to the pre-digested clinical pearl. The topline result, one that the journal itself felt obligated to headline, was that “AI had superior performance compared with physicians for almost every metric (29 of 32 axes)”. But at what?
You would think that the question would have been easy to answer, this being a peer-reviewed paper and all, but no. In fact, I am still not completely certain what interactions were performed and whether they completely match what was reported. What is certain is that a set of primary care physicians and patient-actors from Canada and India — countries different from the author’s own countries and let’s wonder conspiratorially for why that may be the case — interacted via an instant messaging-like service. This is the first oddity: even remote health visits are performed using video calls, and yes you may occasionally get a text through the EMR or if you are a VIP/boutique physician maybe your phone, but that is far from the norm.
The primary report is on what happened when the patients uploaded the skin photos, ECGs, lab results, etc. and then asked the physician or LLM on the other end questions about it. Pretty standard fare for a human-to-LLM interaction, but not exactly natural for a doctor-patient relationship which usually starts with questions being asked of the patient. This is the second way in which the setup was made to fit the computer and not the human.
But then the last section of the paper is about what happens when there is, in fact, a back-and-forth by the way of taking a history. The extended figures — “extended” here meaning not worthy enough of being included in the main paper — say it improves the performance of the LLM. They do not say how it affected the human performance, or how the patient-actors rated humans versus LLMs in history-taking. I would call that strike three.
To the journal’s credit, they did not allow Google to get away with it completely. “To evaluate the performance of our finalized system, we conducted a randomized, blinded human evaluation that emulates an objective structured clinical examination”, says the final paragraph of the introduction, only to end with:
We note, however, that our study is not a randomized clinical trial with prespecified endpoints and preregistered statistical analysis. Rather, it is an exploratory study investigating the properties of multimodal diagnostic dialogue.
Peer review is at least good for something, even if it does result in self-contradiction.
Meanwhile, in the world without motivating reasoning, more objective assessments of the usefulness of AI in medicine show that it is in fact still quite bad. This does not prevent the massively funded hordes of AI researchers from flooding the field with sloppy work, creating the impression that the rise of the machines is imminent. Comply or relegate yourself to the permanent underclass, serf MD. But of course, relegation will only be possible to the extent doctors — or any other profession, really — has already debased itself and abandoned its core professional principles in the service of electronic ease.