In Conversation with…David Urbach, MD, MSc
Editor's note: Dr. David Urbach is Professor of Surgery and Health Policy, Management, and Evaluation at the University of Toronto. He is a staff surgeon at the University Health Network and scientist at the Institute for Clinical Evaluative Sciences. We spoke with him about his New England Journal of Medicine study evaluating the effectiveness of checklists in Ontario, Canada—a study whose results shocked the safety world.
This interview can be heard by subscribing to the AHRQ WebM&M Podcast (.MP3 | 9.39 MB | 6 minutes, 50 seconds)
Dr. Robert Wachter, Editor, AHRQ WebM&M: Tell us a little bit about your motivation for the New England Journal of Medicine study. What gave you or your colleagues the idea of doing it?
Dr. David Urbach: Surgical safety checklists and their use exploded in 2009. Previously, there had been some discussion about communication in operating rooms, teams, and general research on checklists that went back about 5 years. Literature emerged that there were many failures of communication and failures of proper team coordination and functioning in operating rooms. Then a very high profile study was done primarily with the World Health Organization in eight sites around the world—it was a before-after study of implementation of surgical safety checklists—it showed a shockingly high effect of surgical safety checklists. Suddenly there appeared to be very strong evidence that going through a checklist before a surgical procedure could create very large improvements in surgical outcomes—larger than just about any other intervention used around the time of surgery.
To put this into perspective, the study (the senior author was Atul Gawande) showed that there was a 50% reduction in risk of death after surgery. That means one out of every two postoperative deaths could be prevented by use of a surgical safety checklist. That is better than any intervention we know that's effective around the time of surgery. We have pretty good evidence that antibiotics are effective in reducing the risk of wound infection, but there's no evidence that proper use of antibiotics can actually prevent death and reduce operative mortality. Yet a checklist that confirms the use of these various maneuvers by itself seemed to show a very large effect, even larger than the various components themselves.
It was very high profile, and it captured a lot of people's imaginations. Because of that it was widely adopted very quickly by a lot of organizations worldwide that are responsible for surgical safety. The National Health Service in the United Kingdom and the health authorities in the Province of Ontario mandated either the use of surgical safety checklists, or at least reporting the extent of the use of surgical safety checklists. That was implemented almost overnight. For an article published in 2009, by 2010–2011 this became a required operating procedure. That's a very fast trajectory—for something that just appeared in the literature as an effective intervention to becoming a mandatory procedure.
RW: As a surgeon and a researcher, was your take on the Gawande study that it was not credible or that there were study design flaws that accounted for that shockingly high effect?
DU: Well, it's an interesting question. There's a dichotomy here. If you presented that study to any non-specialist in the world they would say, of course, they're not shocked one bit. Part of it is driven by the idea that if you don't use a checklist then things are disorganized, and that it was the lack of a checklist that created opportunities for hazards. But a lot of structures and standard operating procedures were in place in hospitals in North America even before 2009. Many hospitals were routinely monitoring patients' oxygen saturation during surgery. They were giving antibiotics; they were using anticoagulants to prevent blood clots around the time of surgery. These perioperative processes were probably being done very frequently. So the concept that—but for a checklist—things were like the Wild West in operating rooms, I don't think is an accurate depiction. Things could always be improved, but they weren't all that bad in North America.
When I would speak to people about safety checklists, they weren't surprised at all that they appeared so effective. Maybe people don't quite think about what the magnitude of the clinical effect implies, like what a 50% reduction in mortality actually is. But no one questioned this—certainly no one responsible for managing perioperative safety in hospitals, health ministries, or other health authorities. They unquestioningly accepted that these results must be authentic and legitimate.
However, the typical surgeon had difficulty understanding how this could be true, because when you actually look at what's involved in these checklists, the results don't necessarily make sense. A surgeon could believe that it is a good tool; it might improve communication. It definitely will improve work satisfaction among people who work in operating room environments. But most studies didn't look at those types of results. The studies focused on clinical outcomes, and a practicing surgeon would probably wonder how exactly is this translating into such large improvements in perioperative care?
RW: That gets to the issue of face validity. When you read it, you see life through two lenses—one is as a practicing surgeon, the other is as a methodology expert. What were your concerns about what could have happened methodologically that might have explained a difficult to understand result?
DU: Although before-and-after studies sound very simple, in practice a few potential pitfalls can create large biases unless you're looking out for them. The biggest is selection bias. If you do a before-and-after study on use of a surgical safety checklist, in the typical design, people do these in hospitals and collect data from patients and from actual episodes of surgical care. Say you take a period of 3 months, look at all the patients, and see what their outcomes are. And then 3 months later introduce a checklist, look at the outcomes, and compare what happened after to what happened before. But the most important thing is to make sure that patients you're looking at before are very similar to the patients you're looking at after. In surgical care, often it is a very small group (at high risk of complications, high risk of death) that explains a lot of the negative outcomes. If your study design influences your ability to include those subjects in the study, you might end up with flawed results.
I'm a bit suspicious about that happening in some of these studies. Subsequent research shows that in hospitals that have made a decision to adopt safety checklists, the patients on whom checklists are actually done have excellent outcomes, very low operative mortality. Patients who have a partially completed checklist have a better outcome, but not as good as the patients who have 100% of all items completed. The worst group is always those patients who don't have any elements of a checklist done. Typically they have very high operative mortality in the range of 10%, much higher than any typical operation. We think that must indicate some sort of selection bias. For patients in a hospital where checklists are done, if for whatever reason they don't have the checklists done, there must be something odd that makes them different from other patients. Maybe it's because they arrived in shock, or they might have had a ruptured abdominal aortic aneurysm and were rushed to an operation without a checklist being done, or something like that.
RW: It's just a marker for how sick they are.
DU: Now some studies have demonstrated effect, and there may be a hazard for how patients are assigned into these study groups and before-and-after studies. It's really important to make sure that they're identical groups of patients.
RW: How did you conduct your study?
DU: We tried to replicate what the World Health Organization study and a couple of other subsequent studies did. We also used a before-after design. We used relatively similar time periods, but we relied on administrative data. We did not collect any information on patients or their outcomes from hospital records themselves. We relied on electronic sources of health information that are held at the research institute. We contacted hospitals in the Province of Ontario to get a date when checklists were introduced. They also sent us a faxed copy of the checklist that they used, most were very similar to the World Health Organization design or another called the CPSI or Canadian Patient Safety Institute checklist; both are quite similar in their structure. They basically have a three-part design. There's a briefing before the patient goes to sleep, another timeout just before the incision is made, and finally a debriefing after the operation is completed. We got this information directly from the hospitals; then we relied on population-based administrative health data to identify all the people who had surgical procedures before and all the people who had surgical procedures after. There are advantages and disadvantages of our study. You could argue that an obvious disadvantage is that the quality of information is not as good because we're not recording information about patients from hospital charts. On the other hand, our study was much less susceptible to the types of selection biases that could be a real problem if you're relying on identifying these patients as they come in to have surgery.
RW: And what did you find?
DU: What we didn't find was a statistically significant improvement in the time period after hospitals adopted checklists, as compared to the time period before. There was a very slight apparent improvement, but it didn't meet the bar of statistical significance. We know that the outcomes of health care tend to improve over time, and we see this especially when we examine a large number of people. If you look at the outcomes of surgery in 2012 and compare them to 2011, a very small improvement happens annually. We call these secular trends, or secular changes. Our finding may or may not have been similar in size to what you might see just due to general improvements that occur over time. Our findings are inconsistent with benefits as large as those suggested by earlier studies. They found about a 50% reduction in the risk of operative death and a substantial reduction in the risk of complications. We did not find a statistical improvement in death and complications, or in other health service use measures like admission to hospital, emergency department visits, etc.
RW: There are multiple ways of interpreting that, and one would be there's something about the implementation in a large-scale government created program—where it's essentially airlifted in by a central authority—that doesn't work, and the process needs more hand holding and more local massaging. The second is that the methodological issues you've discussed in the Gawande study produced a result that was, in some sense, wrong. Which of those do you lean toward, or is there a little bit of both?
DU: It's a good question. I don't know the answer, and it may be a bit of both. We'd like to see additional studies try to replicate the clinical effects of the introduction of checklists. It is important to recognize that everyone acknowledges certain benefits of safety checklists. These have to do with team functioning, team dynamics, quality of communication, and work satisfaction for people on perioperative teams. None of us argue against any of those things, but we've been focused on clinical outcomes because that was the impetus for widely disseminating safety checklists. Is it that the other studies were flawed, or is it that this mandated use of checklists just doesn't work the same way as when motivated locally by teams with an intrinsic desire to improve quality? Both of those are possible. We're waiting for additional studies to see if the earlier studies can be replicated or if the effects don't seem as large.
What was shocking to us as well was that even though there were relatively few studies published on the effectiveness of surgical safety checklists, there were no negative studies. Just by chance alone, some studies should have shown a different result. The literature needs to mature. If a safety intervention is going to work, it has to be practical. Not every single hospital is going to have an extremely motivated and focused team that is going to develop their own local quality improvement initiative in a way that it becomes uniform across all hospitals in a jurisdiction. If something is going to work and people believe it will, then it probably makes sense to mandate its use. If you say that something really does work but only if it's not centrally mandated, then you have to wonder about how effective that intervention really is. Maybe something else caused the improvement, but not the use of the checklist itself. Maybe you were just measuring what happens with highly motivated, energetic, and well-functioning teams.
RW: As you've now thought this through and seen it at work, can you envision a process that might have worked better and might have led to different results?
DU: We actually looked into that in our study because we had the results by hospital, and we could look at every single hospital in the Province. We tried to identify hospitals that had more of a focus on quality improvement and better motivation around change. We looked at whether they were early adopters of checklists. For example, my hospital was one of the eight hospitals involved in the WHO study. Some hospitals in the Province had these organic, bottom-up teams focused on quality improvement. When we tried to replicate that, not a single hospital in the Province had a statistical reduction in operative mortality—that includes our hospital, and the other early adopter hospitals.
RW: What are the main lessons from your work? One is at the level of the specific issue of the checklist and the second is more general lessons about research and patient safety—about dissemination, mandates—in many ways this is a metaphor for issues that come up all the time as you try to improve safety.
DU: There's a bit of a double standard in the medical literature when it comes to interventions labeled as "patient safety" or aimed at reducing medical error. This notion was popularized by Kaveh Shojania and colleagues about 10 years ago. They highlighted this double standard that we would not be satisfied that, for example, a drug worked unless we had a controlled clinical trial: a good experimental study that endeavored to reduce selection biases and measurement biases as much as possible, so that we could be confident that the results of the study were correct. That standard exists not just for drugs but also for other interventions in health care; we expect there to be randomized trials.
There never has been the same philosophy to have a rigorous standard of evidence for something that looks like it improves patient safety. There are difficulties in studying these types of maneuvers because they're context dependent. They typically can only be implemented in an all-or-none fashion. It's hard to randomize people to getting a surgical safety checklist. In an experimental study, you would probably have to randomize the hospital. It's not as easy to get a large number of subjects and independent pieces of information with which to then analyze the effectiveness of the intervention. The overarching problem to me is that we have this whole area of health care research for which designs like the before-after study constitute acceptable evidence. We would never accept a before-after study to say that an antihypertensive pill is effective, but we routinely accept it for these other interventions for reasons that are not always clear.
RW: I actually wrote some of those studies with Kaveh in the early days. So let me push back on the argument. First, it's not the same because it's much harder to research than a new antibiotic. Second, these are relatively low cost, relatively low-risk interventions and insisting on the standard for a new drug or a device would have us wait too long. Aviation didn't wait for a double-blind trial before barricading the cockpit doors or doing most of what they do in the cockpit to try to improve safety. The argument is safety is different both because it's more difficult to study and because you're talking about some interventions that are relatively low risk and relatively easy to implement.
DU: For surgical safety checklists, a lot of the discussion went along the lines of: well, what's the possible downside? This couldn't possibly be harmful; it might help, so let's do it. I don't completely discount that argument by the way, but there are a few problems with it. One is there is no limit to things that are not harmful and may be helpful. At some point you have to have a standard. You could implement checklists not just for surgical procedures but for just about anything you do in a hospital—from diagnostic imaging tests to assessment of patients in outpatient clinics. There's probably an infinite number of things that you could do that at least someone believes is a good idea. It seems innocuous, it may help, and it has a good conceptual rationale. But you have to have set priorities, and to have decent evidence is probably as good a way as any. I don't think we're being overwhelmed by layers and layers of safety interventions. But I don't think it completely excuses us from developing reliable evidence.
RW: I appreciate that, and I was taking the devil's advocate side. I've argued your point of view more often than the opposite. If you were The Joint Commission, if you were the Province of Ontario, what would you be doing with checklists today?
DU: There is merit to use of checklists, and I would encourage their use. One problem encountered early on is that it didn't have a lot of clinical credibility with surgeons. It had a lot of credibility with nurses and administrators, but there was resistance early on. This occurred all around the world and not just in Ontario. A lot of it was just disbelief that the effects could be true.
The bodies that mandated the use of checklists didn't do it because they felt it integrated nurses and teams better, or that it reassured patients that they were being better cared for, or that it improved communication. It was adopted because they thought it would reduce the risk of adverse events after surgery. They cited the high profile articles; they quoted the magnitude of these effects. In retrospect, that might have taken away from the credibility with frontline users. If you argued that it's very important to engage teams and make them more functional, improve the quality of communication, and reassure patients that people are aware of their individual problems and that the hospital is focused on their care in a patient-based manner, there could have been more buy-in and less skepticism. I don't dispute that it's a good idea to adopt tools like this. There has to be agreement among the frontline workers because otherwise mandatory adoption doesn't result in the same effectiveness because there isn't that degree of enthusiasm, buy-in—energy—and team functioning doesn't improve in the expected way. But it's important to be realistic about what can be achieved.
RW: Do you use checklists in your OR?
DU: Yes. I think they're very useful. But I find it very difficult to believe that in North American hospitals they result in these improvements in health care outcomes. I tend to believe the results of my study rather than some of the earlier studies. But I think surgical checklists are incredibly useful at improving the dynamics of a very large and complex team in the operating room. They really engage the perioperative care team. Suddenly all of the nurses, anesthesia assistants, respiratory therapists, everyone's in the room; everyone is focused on the patient. They understand more of the patient's story, their background, diagnostic tests they've had, what brings them to the operating room. They're great additions to what we do, but I am skeptical that they result in these strikingly large improvements in clinical outcomes that others have reported.
RW: My next question is wildly unfair. If a friend or family member of yours calls you, says he or she is going in for surgery and the surgeon tells them that they're not using checklists because they read your study. What would you say to them?
DU: That is a bit of a loaded question. We were careful in our study not to suggest that checklists should be abandoned. We were just trying to get a better estimate of what their clinical effectiveness is. I would still say the most important thing in choosing a surgeon is not whether they use a checklist; it's whether what they've told you makes sense, whether they answer your questions. They don't make you feel rushed, and you feel comfortable in their care. If you ask them if they use a checklist in this environment in Ontario where it's mandated by our provincial government, I think it's a bit of a warning sign if someone says they're not going to use it. That is indicative of a major problem with their professionalism. But even if checklists weren't mandatory, I still think it's more important that your encounters with the surgeon have led you to have confidence in them, that their proposed treatment makes sense to you, and that if you asked for a second opinion that they encourage rather than discourage that. These are the factors that I tell people are important when choosing a surgeon.
RW: Any sense of where this issue is going to go in the future?
DU: I think there will be more studies. I'm not sure they'll be better. What we'd like to see is a real sampling of what happens. Checklists have now been introduced in thousands of hospitals. In addition to our study, you have results from maybe up to 20 hospitals around the world. So if this area of medicine is like any other area of medicine, there probably is some degree of publication bias, making a journal less likely to publish an article saying, we went to all this trouble and we found that it didn't help. So it would be nice to see the literature mature, so that we can see more studies of checklists introduced in different contexts and get a sense of where ultimately the effectiveness indicator settles.