Why not use machine learning to rank residency applicants?
I just finished attending a 1-hour career panel for UMBC undergrads thinking about medical school, and the one thing anyone interested in practicing medicine in America should know is that you really, really, really need to know how to answer multiple choice questions. It doesn’t matter how smart, knowledgable, or hard-working you are: if you don’t have the skills needed to pick the one correct answer out of the four to six usually given, be ready to take a hit on how, where, and whether at all you can practice medicine in the US.
To be clear, this is a condemnation of the current system! Yes, there are always tradeoffs: oral exams so prevalent in my own medical school in Serbia weight against the socially awkward and those who second-guessed themselves. But the MCQs are so pervasive in every aspect of evaluating doctors-to-be (and practicing physicians!) that you have to wonder about all the ways seen and unseen in which Goodhart’s law is affecting healthcare.
What would the ideal evaluation of medical students look like? It wouldn’t rely on a single method, for one. Or, to be more precise, it wouldn’t make a single method the only one that mattered. Whether it’s the MCAT to get into medical school, USMLE to get into residency and fellowship, or board exams to get and maintain certification, it is always the same method for the majority of (sub)specialties. Different organizations, at different levels of medical education, zeroing in on the same method could indeed mean that the method is really good — see: carcinisation — To save you a click: it is “a form of convergent evolution in which non-crab crustaceans evolve a crab-like body plan”, as per Wikipedia. In other words, the crab-like body plan is so good that it evolved at least five different times. but then if it is so great to be shaped like a crab, where are our crab-like overlords?
Being a crab is a great solution for a beach-dwelling predatory crustacean with no great ambitions, and MCQs are a great solution to quickly triage the abysmal from everyone else when you are pressed for resources and time. But, both could also be signs of giving up on life, like how moving to your parents' basement is the convergence point for many different kinds of failed ambition.
Behind the overuse of MCQs is the urge to rank. Which, mind you, is not why tests like USMLE were created. They were, much like the IQ tests, meant to triage the low-performing students from the others. But the tests spits out a number, and since a higher number is by definition, well, higher than the lower ones, the ranking began, and with it the Goodhartization of medical education. The ranking became especially useful as every step of the process became more competitive and the programs started getting drowned in thousands of applications, all with different kinds of transcripts, personal statements, and letters of recommendation. The golden thread tying them all together, the one component to rule them all, was the number they all shared — the USMLE score.
But then the programs started competing for the same limited pool of good test-takers, neglecting the particulars of why a lower-scoring candidate may actually be a better match for their program. Bad experience all around, unlike you are good at taking tests, in which case good for you, but also look up bear favor. On the other hand, there is all this other information — words, not numbers — that gets misused or ignored. If only there was a way for medical schools and residency programs to analyze the applications of students/residents that they found successful by whatever metric and make a tailor-made prediction engine.
Which is kind of like what machine learning is, and it was such a logical thing to do that of course people tried it, several times, with mixed success. It was encouraging to see that two of these three papers were published in Academic Medicine, which is AAMCs own journal. One can only hope that this will lead to a multitude of different methods of analysis, a thousand flowers blooming, etc. The alternative — one algorithm to rule them all — could be as bad as USMLE.
The caveat is that Americans are litigious. Algorithmic hiring has already raised some alarm, so I can readily imagine the first lawsuit from an unmatched but well-moneyed candidate complaining about no human laying their eyes on the application. But if that’s the worst thing that could happen, it’s well-worth trying.
No one is hiding the miracle cures
So, who wants to dismantle the FDA, you ask? Some patient advocacy groups, among others, aided by a few senators:
We need the FDA to be more insulated from these forces. Instead, every few years, legislators offer bills that amount to death by a thousand cuts for the agency. The latest is the Promising Pathways Act, which offers “conditional approval” of new drugs, without even the need for the preliminary evidence that accelerated approval requires (i.e., some indication that biomarkers associated with real outcomes like disease progression or survival are moving in the right direction in early drug studies).
…
This bill is being pushed by powerful patient groups and has the support of Democratic senators like Kristin Gillibrand and Raphael Warnock, who should know better.
The bill would codify using “real-world data” and unvalidated surrogate endpoints for something called “provisional approval”, a level below the already tenuous accelerated approval.
I can see how it may appeal to patients: you may get a promising new drug for your life-threatening, debilitating disease sooner via this pathway. On the other hand, there are already mechanisms in place that enable access to these: a clinical trial, for one. Or expanded access (a.k.a. “compassionate use”) for those who may not be eligible for a trial.
So how would “provisional approval” help? If anything, wouldn’t it transfer the risks and — importantly — costs of drug development from the drug manufacturer/sponsor/study investigator to the patient?
Ultimately, the reason why there aren’t many cures for rare, terminal diseases is not because the big bad FDA is keeping the already developed drugs away from patients but rather because they are devilishly difficult to develop at our current level of technology. Wouldn’t it then make more sense to work on advancing the technology The careful reader will note that the opposite is being done, and I write this as no great fan of AI. that would lead to those new cures? I worry that the Promising Pathways Act would solve a problem that doesn’t exist by adding to the already skyrocketing costs of American health care. But that could be just me.
(↬Derek Lowe)
Do you know about a horse named Jim? The one whose tetanus-contaminated serum was used to make diphtheria antitoxin that killed kids:
These failures in oversight led to the distribution of antitoxin that caused the death of 12 more children, which were highly publicized by newspaper magnate Joseph Pulitzer as part of his general opposition to the practice of vaccination.
There is a straight line from Jim to the creation of US FDA, in case you want to remove that particular Chesterton’s fence.
Seth Godin on the amateur presenter:
If you’re called on to give a talk or presentation, the biggest trap to avoid is the most common: Decide that you need to be just like a professional presenter, but not quite as good. Being a 7 out of 10 at professional presenting is a mistake. Better to stay home and send a memo.
This is exactly what happened last month at that medical conference. Colleagues, please stop.
Term confusion alert 2: outcome versus endpoint
Our clinical trials course at UMBC is well under way, and we are getting some terrific questions from students. Here is one!
Q: Are outcomes surrogate endpoints or is there a distinction between the two?
The terms “outcome” and “endpoint” are not strictly defined and some people use them interchangeably. However:
- Outcomes are broader, and include any change in health that are considered important enough to measure in a patient (such as “overall survival” — the amount of time between enrolling onto the trial and death, or “quality of life” — a certain score on a specified scale that a patient fills out or the doctor performs).
- Endpoints are more specific than outcomes, consider the whole study population instead of individual patients, and need to have a precisely defined way of measurement and time points when they are measured (e.g. “median overall survival”, “3-year overall survival rate”, and “5-year overall survival rate” are three different endpoints that are different ways of aggregating and evaluating the same individual patient outcome — overall survival).
It reminds me of the confusion between efficacy and effectivness, only it’s worse: there is no agreed-upon text that describes the distinction, so it is a really terminological free-for-all. Indeed, what I wrote above may end up not being true — caveat lector! As always, it is always best to ask people to clarify what they meant when they said this or that. Regardless, if someone tells you that “overall survival” (or, worse yet, “survival”) was the primary endpoint, it clearly can’t be the case. Endpoints need to be more specific than that.
Surrogate outcomes and surrogate endpoints are those which are stand-ins for what we actually care about. Here is a good video on surrogate endpoints in oncology.E.g. when we give chemotherapy to someone with cancer, we do it so that they would live longer and/or better. However, it is quicker and easier to measure if the tumor shrinks after chemotherapy (i.e. “responds” to treatment), and we believe that the tumor shrinking will lead to the patient living longer or better (which may not necessarily be the case!), so we use the response as a surrogate outcome for survival and quality of life (by how much did the tumor shrink? was it a complete or a partial response according to pre-specified criteria?). Study level surrogate endpoints would be the overall response rate, partial response rate, complete response rate, etc.
We have created so much confusion here that it is a small miracle we can communicate amongst ourselves at all.
Michael Bonner on MAID, Canada’s euthanasia program:
In Canada, patients can wait years for medical treatment in the country’s overburdened and underfunded health-care system. The baby boomer population, as it ages, will only increase the strain on the medical system, the welfare state, and the Canada Pension Plan. A 2017 study published in the Canadian Medical Association Journal suggested euthanasia could save the country between $34.7 million and $138.8 million annually—a “substantial savings,” the authors said. Is this why the government is eager to expand MAID?
Kurt Vonnegut, 2 B R 0 2 B:
Got a problem? Just pick up the phone. It solved them all — and all the same way!
November lectures of note
Next Wednesday looks busy, but there is a Thanksgiving-sized gap in the calendar.
- Development of a CMV Vaccine: From Bench to Congress by Sallie Permar, MD PhD; Wednesday November 8 at 12pm EST
- Unraveling Regulation and Unexpected Role of Interleukin 17 by Huyn Park; Wednesday November 8 at 4pm EST
- Nanomedicine: Which Cancerst to Treat by Gregory Lanza, MD PhD is a 3-hour workshop that is free to public requires pre-registration; held today at 2pm EST, and if you can’t make it there is one on Wednesday, November 8 and Tuesday, November 14, also at 2pm EST
- Exploratory and Inferential Spatial Statistical Methods: Tools to Understand the Geography of Health Across the U.S. by Loni Philip Tabb, PhD; Tuesday, November 18 at 1pm EST, open to public by requires registration
I can only nod my head in agreement with Jake Seliger’s description of clinical trials in America:
The clinical trial system could be a lot worse, but, in its current state, it’s far from optimal, to the point that I’d characterize it as “pretty decently broken.”
Well, maybe add one thing: it is a small miracle that the system works at all, considering that it is built on top of an even more dysfunctional health care system. (ᔥTyler Cowen)
Term confusion alert: efficacy versus effectivness
We like to do things in medicine, and medicine’s big contribution to science was figuring out how best to answer the question of whether the things we do actually work. But of course things aren’t so simple, because “Does it work?” is actually two questions: “Can it work?”, i.e. will an intervention do more harm than good under ideal circumstances, and “Does it work in practice?”, i.e. will an intervention do more good than harm in usual practice.
We also like to complicate things in medicine, so the person to first delinate this distinction, Archie Cochrane of the eponymous collaboration named them efficacy and effectiveness respectively — just similar enough to cause confusion. He also added efficiency for good measure (“Is it worth it?) Fifty years later, people are still grappling with these concepts and talking over each other’s heads when discussing value in health care. Which is to say, it’s best not to use the same prefix for overlapping terms, but if you had to, “eff” is most appropriate.
The most recent example is masks. Cochrane Colaboration’s review said they didn’t “work” The paper caused an uproar and language has since been toned down, but that was the gist. for preventing respiratory infections. Now, knowing what Cochrane was all about the first question to ask is: what sense of “work” did the authors intend, and this particular group is all about effectiveness (working in “the real world”), not about efficacy (working under ideal conditions). This caused some major cognitive dissonance among the covid-19 commenters. Vox had the typical sentiment:
Furthermore, neither of those studies [included in the meta-analysis] looked directly at whether people wear masks, but instead at whether people were encouraged or told to wear masks by researchers. If telling people to wear masks doesn’t lead to reduced infections, it may be because masks just don’t work, or it could be because people don’t wear masks when they’re told, or aren’t wearing them correctly.
There’s no clear way to distinguish between those possibilities without more original research — which is not what a meta-analysis of existing work can do.
But this is the difference between ideal (you force a person to wear a mask and monitor their compliance) and typical conditions (you tell the person to wear a mask and keep your fingers crossed), and Cochrane is interested in the latter, Though of course, the chasm between ideal and typical circumstances varies by country, and some can do more than others to bring the circumstances closer to ideal, by more or les savory means. which is the one more important to policy-makers.
This is an important point: policy makers make broad choices at a population level, and thus (do? should?) care more about effectiveness. Clinicians, on the other hand, make individual recommendations for which they generally need to know both things: how would this work under ideal conditions, how does it work typically, and — if there is a large discrepancy — what should I do to make the conditions for this particular person closer to the ideal? We could discuss bringing circumstances closer to ideal at the population level as well, but you an ask the people of Australia how well that went.
The great colonoscopy debate is another good example of efficacy versus effectivness. There is no doubt that a perfectly performed colonoscopy at regular intervals will bring the possibility of having colon cancer very close to zero, i.e. the efficacy is as good as you can hope for a medical intervention. But: perfection is contingent on anatomy, behavior, and technique; “regular intervals” can be anything from every 3 months to every 10 years; and there are risks of both the endoscopy and the sedation involved, or major discomfort without the sedation. And thus you get large randomized controlled trials with “negative” results Though they do provide plenty of fodder for podcasts and blogs, so, thanks? that don’t end up changing practice.
So with all that in mind, it was… amusing? to see some top-notch mathematicians — including Nassim Taleb! — trying to extrapolate efficacy data out of a data set created to analyze effectivness. The link is to the preprint. Yaneer Bar-Yam, the paper’s first author, has a good X thread as an overivew. To be clear, this is a worthwhile contribution and I’ll read the paper in depth to see whether its methods can be applied to cases where effectiveness data is easier to come by than efficacy (i.e. most of actual clinical practice.) But it is also an example of term confusion, where efficacy and effectiveness are for the most part used interchangeably, except in the legend for Table 1 which say, and I quote:
The two by two table provides the incidence rates of interest in a study of the efficacy (trial) or effectiveness (observational study) of an intervention to reduce risk of infection from an airborne pathogen.
Which seems to imply that you measure efficacy exclusively in trials and effectiveness in observational studies, but that is just not the case (the colonoscopy RCT being the perfect example of an effectiveness trial). And of course it is a spectrum, where efficacy can only be perfectly measured in impossible-to-achieve conditions of 100% adherence and a sample which is completely representative of the population in question so any clinical trial is “tainted” with effectiveness, though of course the further down you are on the Phase 1 to Phase 4 rollercoaster the closer you are to 100% effectivness.
I wonder how much less ill will there would be if the authors on either side realized they were talking about different things. The same amount, most likely, but one could hope…
Update: Not two seconds after I posted this, a JAMA Network Open article titled “Masks During Pandemics Caused by Respiratory Pathogens—Evidence and Implications for Action” popped into my timeline and wouldn’t you know it, it also uses efficacy and effectiveness interchangeably, as a matter of style. This is in a peer-reviewed publication, mind you. They shouldn’t have bothered.
The sad state of (Serbian) science news
If you thought the state of American media was bad — and justifiably so — I can assure you that most of the world has it much worse. Every so often I get sent a link to a Serbian news site writing about cancer research, and it is always a disaster. Here is the most recent one, short enough to be quotted fully here (translation courtesy of Google):
A German company presented an anti-cancer drug: The tumor stopped growing in all patients
The German company Biontek (BioNTech) is currently raising hopes with its cancer vaccine CARVac.
The first research results show that tumors can be stopped from growing, and sometimes even reduced. The first successes occurred after two out of four vaccination doses.
Most study participants (59 percent) had their tumors shrink by at least 30 percent. In addition, the tumor stopped growing in almost all patients (95 percent) after vaccination. Like the covid 19 vaccine, the vaccine is based on mRNA technology.
This means that a certain protein is taken into the cell, allowing the body to repair it itself.
The new vaccine was developed by a team led by Biontek founder Ugur Sahin (58) and founder Ozlem Turecci (56).
So far, 44 patients have received it in four doses. Success was particularly high after two doses, after four doses the tumors were reduced by at least 30 percent in just under half (45 percent), and the cancer was stabilized in 74 percent of all patients.
Let me list the ways in which this is a terrible new story:
No source
Where did the data come from? Was it a paper, an abstract, a press release, or a leak? A 2-second journey to DuckDuckGo shows that they were, in fact, presented at the 2023 ESMO Congress, which is the annual gathering of the European Society of Medical Oncology. The Serbian website does mention a Bosnian article as a “source” for there copy/paste job, but that article also doesn’t list where the data came from.
Wrong data
“The first research results…”, the article begins. Being the first is big news. But this aren’t the first results. Some were presented last year at the same congress, and even that was a follow-up of data presented earlier.
Incomplete data
Vaccines make the news, so that’s what they highlight, but the trial is actually of a cell therapy with and without the vaccine. The 44 patients they mention are the ones who got the cell therapy with and without the vaccine, and there is no breakdown of how many of them got the actual vaccine. With cancer vaccine’s abysmal past record No, they are not now being “tried in cancer” after the success in Covid-19. They were, in fact, developed for cancer treatment, experienced failure after failure, and pivoted back to infectious diseases because of Covid-19; and what a good thing for all of us that they did! I highly doubt that the effect we saw was wholly due to the cells, not the vaccine (then again, I work at a cell therapy company). The paper which came out concomitantly with the abstract shows that about the same number of participants who got the vaccine progressed and responded (see Figure 2 for that).
No context
“The tumor stopped growing in all patients”, the headline says. Well, loog at Figure 2 again, it’s what we call a waterfall plot, which is an aspirational name: if the bar goes up from baseline it means that the tumor grew, if it goes down it means that it shrank, so you want it to look like a waterfall. But in 8 of the 21 participants presented in the paper it grew! And in 5 more it barely came down — those count as “stable disease” because measuring tumors is not a precise science and a pixel here or there on the digital ruler can make all the difference. In only 8 of the participants did the tumor shrink, and in only one of those did it go away completely.
This is, I’m sad to say, about what you would expect for a Phase 1 trial of a cancer drug. Most patients who make it to such a trial have slow-growing tumors, and having a “stable disease” in that context — where you are allowed to have the tumor grow by 20% before calling it “progression” — is perfectly meaningless. Note that you will find terms like “disease control rate” or “clinical benefit rate” which combine participants whose tumors shrunk with those who had this “stable disease”. Those two metrics are also meaningless without a control group.
This became longer than I intended so I’ll stop here, but yes, it’s a sad state. It reminds me of dostarlimab, only much worse since in that case there was at least clear evidence that the drug was good, the only thing missing was context. Caveat lector!