In Conversation with…John D. Birkmeyer, MD
Editor's note: John D. Birkmeyer, MD, is an internationally recognized health services researcher with expertise in performance measurement, hospital efficiency, and value-based purchasing. A practicing general surgeon, he has served in advisory roles for the Center for Medicare and Medicaid Services and the American College of Surgeons-National Surgical Quality Improvement Program, and he chairs the expert panel on evidence-based hospital referral for the Leapfrog Group. He is currently Chief Academic Officer and Executive Vice President at Dartmouth-Hitchcock. We spoke with him about his video study that found a link between practicing surgeons' directly observed technical skills and surgical safety outcomes.
Listen to an audio excerpt of the interview (.MP3 | 11.4 MB | 8 minutes, 21 seconds)
Dr. Robert Wachter, Editor, AHRQ WebM&M: What gave you the idea of doing the video study?
Dr. John D. Birkmeyer: Well, we'd been interested for many years in what drives very large variations in surgical outcomes both across hospitals and across surgeons. We'd spent a long time thinking about components of perioperative care—the use of antibiotics, the use of prophylaxis against postoperative blood clots—all of the things you can easily measure that occur both before and after surgery. But to fully understand why some hospitals and some surgeons have better outcomes than others, we felt it was necessary to get at the heart of the matter: to what the surgeon actually does in the operating room.
RW: Tell us how being a surgeon influenced you. Have you seen marked variations in technique?
JB: As researchers in surgery, we've all seen wide variation in surgical outcomes as reflected by statistics from big national databases. But being on the other side of the fence as a clinical surgeon, we see wide disparities not only in how surgeons do different types of procedures but how well they do them. We see that with surgical residents, and to a lesser degree we're also aware that some practicing surgeons are much more skilled than others.
RW: How did you choose the methodology for that study?
JB: We came to the methodology for this study because we had exhausted every other avenue of inquiry in discerning what happens in the operating room. For many years working with this large collaborative of bariatric surgeons and hospitals in the state of Michigan, we studied all of the other things that could plausibly drive differences in outcomes that relate to practices in the operating room. We studied measurable aspects of safety culture and communication and teamwork in the operating room. We used both qualitative and quantitative methods trying to understand how handoffs and distractions—not so much at the hands of the operating surgeon but at the hands of the entire team—might affect outcomes. Then finally we looked at whether there were aspects of technique that could explain why some surgeons have better outcomes than others. For example, does it matter if that anastomosis is hand-sewn, or is the use of a stapler safer? After a couple of years studying those aspects of operative quality as carefully as we could, we succeeded in explaining only a small additional amount of variation in outcomes. At the end of the day, we decided to take the plunge into doing something that is much more intrusive but perhaps much more meaningful in that it's really getting at the proficiency and skill of the operating surgeons.
RW: I might have envisioned trying to have your own people going into all these surgeons' ORs to film them. But it sounded like you asked them to send you videos. What drove you to that decision, and did you have any concerns about asking them to supply their own data?
JB: For many years, we had been engaged in the practice of round-robin site visits, where I and other members of our research team would go and watch how programs do things and how surgeons operate. Certainly at a qualitative level we appreciated wide variation both in technique and skill. The problem was that we needed to do it on a larger scale, and we needed to do it in a way that allowed us to empirically assess skill. To respect both the confidentiality and comfort levels of the surgeons, we decided to have surgeons mail in their videos rather than us actively recording them.
We conspicuously told surgeons to pick any one they want, which at the end of the day implied that most surgeons were sending us cases that they were proud of, in other words, cases that went well. Many would wonder if, from a scientific point of view, that was the right way to do it. Obviously you're not getting their average practice. You're getting their very best practice. But aside from the practical upsides of doing it that way, there are probably some scientific advantages as well in that, from a measurement point of view, we were essentially removing all of the measurement noise associated with some cases being harder than others and some cases going well and others going poorly. This allowed us to get at the cleanest, purest assessment of a surgeon's skill, or at least their best possible skill. The fact that those measures were so powerfully associated with outcomes only speaks to how important surgical skill is.
RW: As I looked at the videos, the examples of the surgeons with relatively poor surgical technique are really striking. Do you think those surgeons know that?
JB: Well, no surgeon believes that he or she is below average. None of the surgeons that agreed to participate in this study went into this study thinking that they were going to be in the bottom quartile of skill. It was only afterward, as they participated in the process of rating their peers and they ultimately saw the empirical findings, that they appreciated the amount of variation. At least for that subset of surgeons, they learned exactly where they stood against their peers.
It was really eye opening for me as the lead of the study. Although I appreciated at some clinical level that some surgeons are better than others, I was just astounded at the degree to which that was true among mature, highly specialized, generally high-volume surgeons. As the videos started to come in and as I began to evaluate them as one of the peer raters, I was completely flabbergasted with what I saw. Just by chance some of the earliest videos that came in happened to be from surgeons that were really, really good. I have to say as a peer rater it was sort of depressing for me to be rating surgeons that are clearly technically better than I am. But then, a couple of videos later, I was sampling recordings of surgeons that ultimately rated at the lower end of the spectrum. I not only felt better about my own very average skill, but I also appreciated that there really is something here. I also appreciated that, in some instances, there are legitimate clinical safety concerns associated with some of the videos we were seeing.
RW: So you're watching the videos come in, you're seeing variations in technique that are even a little bit greater than you might have anticipated. You had a hypothesis that this somehow would be related to outcomes. When you cracked that code and realized that they were in some ways staggeringly associated with outcomes, what was your thought about that?
JB: I have to say that most surgeons believe that the quality of the operative procedure itself is a big driver of patient outcomes. As researchers, the idea that it's not about the physician, it's about the entire team and about the entire episode of care is so ingrained in us, but I had a hypothesis that surgeon skill would matter.
However, I never imagined for a second that it would be as powerfully associated with outcomes as ultimately our findings indicated. When our statistical analyst walked into my office with the very first analysis of the relationship between the skill ratings of the 20 surgeons and rates and summary measures of each surgeon's complication rate, I was completely floored—the data were almost too good to be true. I thought that somehow that there was some tautology or other error associated with how the analysis had been done because it looked like somebody had fudged it or faked it. Then when we looked at the data every which way we possibly could and looked at alternative measures of skill versus alternative measures of quality or outcome, we kept finding exactly the same thing. No matter how we measured skill on the front end or which outcome or aspect of quality we evaluated on the back end, the correlation was just amazingly powerful.
RW: How has this study been received in the surgical community?
JB: On the one hand, I don't think that too many surgeons were surprised by the basic finding that operative skill varies. Every surgical resident that works with multiple attending surgeons knows that, even within his or her hospital, there is wide variation in skill. So that finding was not a surprise. I also don't think any surgeon was surprised that operative skill or proficiency would be associated with outcomes.
What did take surgeons and the profession at large back to some degree was the relative simplicity and ease with which you can actually measure the skill of a surgeon. It seems so obvious, but nobody had really done that before. I think surgeons were a little surprised how, even where you're focused on comparatively high-volume surgeons all doing the same type of procedure, and they're picking which operation to send you, that there was so much variation.
Finally, surgeons were surprised not that skill mattered but at the magnitude of the association between skill and outcomes. The fact that the bottom quartile of surgeons ranked on skill had rates of complications, reoperations, and other measures that were about three times higher than the upper quartile of surgeons ranked on skill. All those findings took readers and particularly surgeons off guard. What probably has been most interesting for me to observe is the visceral reaction that surgeons have had to the study with regard to what it means for their own profession. I don't think that it's too difficult for surgeons to envision how these data could be interpreted and what they might mean in future years to how surgeons become surgeons in the first place, to how they're certified, to even how they get hired by hospitals.
RW: So it's clear that variation in technical skills is related to these outcomes. What do you think the implications are for the selection of surgeons or proceduralists?
JB: Medical students basically get to be the types of doctors they want to be, primarily as a function of their own personal preference. Medical students choose to go into surgery because they like it, they think it's cool, or they think it's a particular match to their personality style and all of the stereotypes that are associated with the profession. Not because there is any reason for them or us to believe that they are particularly good at it. So I could imagine, as part of the evaluation process for medical students transitioning or applying for surgical residencies, that there could be an aptitude test to assess a core set of visual motor skills. I don't know exactly what that test would look like or the associated science with making sure that the test is reliable in separating good future surgeons from not-so-good future surgeons. But I'm sure that folks with expertise in skill assessment are probably thinking about how to operationalize that now.
RW: Are there lessons from other fields with similar attributes? I'm thinking about the military or aviation where they have done some aptitude testing before people enter the profession.
JB: In aviation and in a couple of other military-associated fields, very early rigorous aptitude testing assesses whether you have the basic skill set for success in that field. A very wide battery of simulation-based tools and other instruments also have been developed in the graduate medical education context that could be applied. Not simply to evaluate surgical residents as they progress through their training, but that could perhaps be applied even before they get there in making sure that we select those with the most aptitude—or at least that we identify that small subset of individuals that just aren't well suited to operative surgery.
RW: You've been discussing this in a way that would imply that we all come into these professions with our genes having determined whether we can be good at this. Do you believe that's true, or do you believe that one of your lower quartile surgeons could have been trained into becoming a medium or even high technically proficient surgeon?
JB: I'm completely confident that training and coaching can make anybody better, whether it is making me better as a tennis player or making a below average practicing surgeon an average practicing surgeon. My colleagues Nancy Birkmeyer and Justin Dimick have a new study funded by the National Institutes of Health that is applying a statewide coaching intervention with a video-based platform. It is intended to basically coach every bariatric surgeon in the state of Michigan on how to do better. The hypothesis of that study is that we can make the really good surgeons just a little bit better, we can make the average surgeons a little bit better, and perhaps the worst surgeons have the most room to move. However, I don't think I believe, or anybody else believes, that you can completely eliminate the disparity associated with natural aptitude and skill. There is no doubt that my own tennis game has room to move, but all the coaching in the world is never going to make me Roger Federer. I believe that that ultimately will hold true for operative surgery as well. Both for scientific but also for practical reasons, we'll never entirely eliminate the issue of variation in both skill and in outcomes. From a public health and a hospital safety point of view, our first priority is to deal with the small subset of surgeons that are on the wrong side of the distribution and really pose legitimate safety issues for patients.
RW: As you think about the implications of your work, from the standpoint of an individual hospital's credentialing process and from the boards that assessing the quality of individual doctors or from the standpoint of The Joint Commission that's assessing the quality of a hospital and its safety systems, what should they take from your study?
JB: I believe that peer-based video assessment of skill will become a standard part of the board certification and of the recertification process. Not that there is anything wrong with the 200-question surgical SATs that we have to take every 10 years to maintain our certification, but something based on what we do in the operating room and the specific subsets of procedures we do would probably be much more meaningful. I suspect that process will not only become one of the cornerstones for board certification for certain types of procedurally oriented specialties, but also a main avenue by which hospitals hire surgeons in the first place.
Why not have those applicants send a video representative of the most common types of procedures they do? Once a hospital has already hired a surgeon, video-based assessment probably has obvious implications for how we deal with surgeons that are struggling with their outcomes. Right now the peer review process is based on chart abstraction and people looking in retrospect about whether the surgeon made certain errors in terms of technique or perioperative judgment. For certain types of procedures, a blinded assessment of his or her skill for the most common operations that they do could not only be done prospectively but also has the potential to be much more discriminatory in identifying surgeons who are safe.
RW: So you have been talking about peer assessment, and studies seem to show that can be done and it's statistically robust but expensive and time consuming. In a world of Google cars and people assessing your golf swing through computers, I imagine there will come a day that the kind of assessment that your peers do could be done by a computer. Are we there yet, or is that on the horizon?
JB: I have no doubt that if we can make cars that drive themselves that, at least for videoscopic surgery, we can use big data approaches to assessing spatial movements of instruments that appear on a screen and ultimately link them to outcomes. I don't know whether we'll be there in 3 years or 5 years or 50 years, but I have no doubt that we'll get there.
In the meantime, there may be some intermediate strategies for assessing surgeons that are both cost efficient and can be done on scale. One of those approaches is so called crowdsourcing. At the University of Washington, they had anonymous videos of robotic and other types of laparoscopic surgery rated, first by experts in the field but then outsourced them to hundreds or thousands of nonclinical video assessors from all over the world that had no formal training in surgery or even in clinical medicine.
What struck me about the data was just how remarkable even an untrained eye can be in identifying who's skilled and who's not. In that study, the correlations between the expert ratings and the average ratings of the crowdsourcers was sometimes in excess of 0.9, which may not be perfect enough if we're talking about credentialing. But it probably is accurate enough to serve as a screening tool for identifying the large majority of surgeons that are average or above average and then reserve expert ratings for the minority of surgeons for whom there are concerns.
RW: You mentioned part of what got you stimulated to do this is a bias in the field of safety and quality that it's all about the system and the team and not really about individual performance. Does your work have broader implications to the way we're thinking about quality measurement and patient safety?
JB: I think so. There are at least three important avenues that the research enterprise will need to take this field. The first and perhaps most obvious is that we need to assess the generalizability of our findings with bariatric surgery to other surgical disciplines. We're actively pursuing those very questions across the different types of collaboratives that we have in Michigan. My hypothesis is that we will be able to closely replicate our findings with other similarly technically complex operations. I'm sure, however, that as the literature is fleshed out over the next several years, we'll find other types of procedures or disciplines for which the operative skill and aspects of technique play a comparatively small role. It's really just aspects of cognitive skill and picking the right patients for surgery in the first place or the diligence of postoperative care that drives how patients do. Identifying which subset of surgical procedures we need to target our assessment and improvement activities on skill will be paramount.
The second major avenue falls more on the psychometric disciplines. The measurement approach that we took to our study was very clinically intuitive, and the end results show that it worked. But whether our instrument and our approach to assessing skill is in fact the optimal one is unclear, and I'm sure that people with more expertise in that area than me will come up with simpler, more parsimonious ways of assessing which surgeons are safe and which are not. Finally, I think that the research community will make strides in figuring out the best, most scalable ways to translate skill measurement into improvement work. I described what we are doing across the state of Michigan with funding from the National Institutes of Health on video-based coaching. But one-on-one coaching is not the only avenue for making surgeons better.
RW: If you were a patient and you saw those videos, you would be quite scared that you were going to end up with Surgeon B rather than Surgeon A. What should a patient do today?
JB: Well, obviously patients have very little access to high quality data about a surgeon's skill or even the videos of surgeons themselves doing the types of procedures that the patients need. While in the ideal world patients would have access to this type of information, the best that they can rely on now are crude proxies of proficiency and skill. Where those data are available, I think that for complex high-risk surgery, patients should still consider the operative volume of the surgeon and the hospital. They should also look at the outcome measures for those procedures and those specialties for which that data are available publicly. I know that the Society of Thoracic Surgeons is moving toward making the outcomes data of cardiac surgeons more publicly accessible to patients, and my hope is that over the next several years that the other professional organizations follow suit.
RW: If you had access to both the videos and the outcomes, which would you choose to pay the most attention to?
JB: I personally would pay attention to both. But particularly for procedures or specialties where surgeons don't do a million of them, so the outcome measures tend to be statistically imprecise—perhaps influenced by a surgeon's case mix—I think that a measure or a direct observation of their skill and their technique is probably more valuable.