Beyond Verification — What Responsible AI Really Demands of Human Experts

For the fifth year in a row, MIT Sloan Management Review and Boston Consulting Group (BCG) have assembled an international panel of AI experts that includes academics and practitioners to help us understand how responsible artificial intelligence is being implemented across organizations worldwide. In our first post this year, we explored how organizations should think about AI’s impact on the workforce, with our experts stressing that responsible AI means looking beyond the safety of AI systems to address real-world consequences for workers and economic stability.
This time, we asked our panel to react to the following provocation: Responsible AI efforts fail if they don’t cultivate human experts who can verify AI solutions. On the surface, there is broad consensus, with a clear majority (84%) of our panelists agreeing or strongly agreeing with the statement. But a deeper dive reveals that panelists define verification far more expansively than the provocation implies. Rather than treating it as a narrow, output-by-output check, they describe verification as the work of applying human judgment across an AI system’s life cycle, interpreting context, designing tests, auditing workflows, setting thresholds, weighing when AI should not be relied on at all, and carrying the accountability that machines cannot. Understood this way, verification is not a final checkpoint but the connective tissue of responsible AI, encompassing the design, oversight, and accountability that organizations need to scale alongside the systems themselves. Below, we share panelist insights and offer our practical recommendations for organizations seeking to cultivate the human expertise their responsible AI governance efforts depend on.
Humans provide the context for verifying AI outputs. ForHumanity founder Ryan Carrier backed the consensus that responsible AI efforts must cultivate human expertise to verify AI outputs because, as he puts it, “context matters.” Similarly, TÜV AI.Lab CEO Franziska Weindauer notes, “AI solutions operate within complex real-world contexts, and human experts are essential to interpret results, detect failures, and ensure that systems function as intended.” As GovLab chief research and development officer Stefaan Verhulst explains, “Many of the most significant risks of AI are societal rather than technical, such as misalignment with public values, harmful impacts on vulnerable groups, or inappropriate deployment contexts.” Those risks, many experts contend, are precisely the ones hardest to address with a wholly technical solution.
For some, context is irreducibly human and cannot be captured in machine-readable form alone. As OdeseIA president Idoia Salazar explains, “Not everything is translated into data, such as context in a specific situation.” Distinguished member of the investments committee of the Co-Develop fund’s Yasadora Cordova agrees that responsible AI requires “contextual sensitivity” — a quality that, in her view, “cannot be automated.” Jai Ganesh, Ph.D., vice president of technology, connected services, engineering, at Wipro Ltd., adds, “Situational awareness is another area of concern for AI systems where an output that is correct may be culturally insensitive or legally problematic in a specific country or situation.” Automation Anywhere’s Yan Chow similarly observes that “humans identify sociopolitical nuances and shifts that data cannot capture.” For these reasons, National University of Singapore provost Simon Chesterman concludes that “however sophisticated the model or elaborate the governance framework, someone must still be capable of asking whether a system is reliable, lawful, and appropriate in context,” a responsibility, in his view, that requires human expertise.
If context cannot be fully captured by machines, the practical consequences are significant. Carrier argues that “domain experts are necessary to provide feedback and risk assessments that result in well-tailored controls, treatments, and mitigations designed to tackle the specific and unique risks presented by context-dependent AI deployment and usage.” Salazar goes further, contending that “no matter how advanced a tool is, it cannot be the one to guarantee that its outputs are fair, safe, or appropriate to the context.” For Ganesh, the risks are heightened with “edge cases, rare scenarios, and new contexts where AI systems tend to break down,” and he believes “catching these failures requires human judgment and deep domain expertise.” Chow agrees that human expertise is critical for building “expert-validated guardrails for the edge cases where AI is most fragile.” Moreover, he argues that “responsible AI frameworks collapse into compliance theater without human experts because AI cannot perceive dynamic context.”
Losing human expertise poses an existential threat to organizations. The concern is not only that AI systems will fail without human expertise to verify outcomes but that organizations may lose human expert capacity over time. Cordova argues that “organizations that delegate verification only to AI erode the institutional capacity to audit it as expertise atrophies and junior staff never develop independence.” Likewise, consultant Linda Leopold cautions, “If we always let AI do the work for us, we gradually lose the expertise needed to oversee it,” and “we need to keep human judgment sharp enough to challenge it.” EnBW chief data officer Rainer Hoffmann says, “Responsible AI efforts fail not because humans cannot verify every AI decision but because organizations lack the expertise to govern how AI systems should be evaluated, monitored, and deployed responsibly.”
The business stakes, through this lens, are fundamentally human. As Australian National University’s Belona Sonna contends, “The core objective of responsible AI is not only to design systems that align with ethical principles but also to ensure that humans remain capable of intervening when misalignment occurs.” Put differently, Salazar says that responsible AI “needs people who are prepared not to delegate to machines what remains a fundamentally human responsibility.” Without this capacity, the question of whether responsible AI requires human verification of AI outputs becomes moot — as no one left has the expertise to do it.
Human verification alone does not scale. Despite broad support for the importance of cultivating human expertise, many experts cite concerns about the scale and scope of human verification. Wharton School professor Kartik Hosanagar explains: “There are many settings where it’s helpful to have human verification. But there are many others where human verification is infeasible because of the scale of verification needed.” Hoffmann agrees that for “applications that process large volumes of data or detect patterns beyond human capability, output-by-output human verification is neither feasible nor meaningful.” For some experts, requiring human verification to scale in this way would undermine the entire value proposition of using AI in the first place. As Öykü Işik puts it, “the core value of AI lies in its speed and scale,” such that “requiring human verification for every output would effectively neutralize these efficiency gains.”
The solution, for these experts, is not to abandon human judgment but to deploy it more strategically. Philip Dawson, head of AI policy at Armilla AI, believes that “as AI systems grow in complexity and deployment velocity, human-only verification becomes a structural bottleneck” and requires a different approach. Citing cybersecurity as an example, Işik contends that a system needs the ability to identify when human intervention is needed “while relying on automated decision-making for the bulk of the workload to avoid massive operational bottlenecks” and argues that “the most successful responsible AI efforts treat human expertise and automated tools as a combined system.” Alyssa Lefaivre Škopac, director of trust and safety at Alberta Machine Intelligence Institute, advocates for a “defense-in-depth approach that spans everything from front-line users who can meaningfully question an output to the professionals building the assurance ecosystem around these systems.” Dawson similarly contends that “the field must invest in automated evaluation frameworks and agentic assurance pipelines that extend, not replace, human judgment at scale.”
Oversight and accountability remain paramount. In addition to relying on a combination of human and machine verification, our experts believe that oversight and accountability remain paramount to any responsible AI strategy. Chesterman argues that “verification should not be understood too narrowly.” He adds, “In some settings, human experts will directly validate outputs; in others, they will design tests, audit workflows, set thresholds for acceptable use, or decide when AI should not be relied upon at all.” In other words, as Chow puts it, “Human expertise is a design-time necessity, not just a run-time check.” Former DBS Bank chief analytics officer Sameer Gupta agrees that “governance and oversight should be embedded into every stage of an AI solution’s design and deployment rather than treated as a final checkpoint on the outputs alone.”
Many experts argue that human verification of AI outputs is essential not as an end but as a core part of meaningful oversight and accountability over AI systems. IAG chief AI scientist Ben Dias explains that as “a technological construct … AI systems lack the agency to be held legally or ethically accountable for the consequences of their actions.” For this reason, Dias says, “every AI solution needs an accountable human who is responsible for ensuring that the system’s outputs are properly understood and verified.” ADP’s chief product officer Naomi Lariviere agrees, saying, “AI systems can generate recommendations and automate decisions, but they can’t carry accountability.” Mike Linksvayer, vice president of developer policy at GitHub, argues that “as systems become more agentic, the limiting factor is no longer the ability to check individual outputs but the ability to exercise informed judgment over goals, constraints, escalation paths, and responsibility.”
Recommendations
If the limiting factor is the ability to exercise informed judgment, not just check AI outputs, then organizations need to invest in that judgment deliberately. We offer the following recommendations for organizations looking to cultivate human expertise that scales with their AI ambitions:
1. Verify designs, not just outputs. A narrow view of human verification that only addresses system outputs is insufficient. Human verification, in the broader sense of human oversight, should be embedded at every stage of an AI solution’s design and deployment, not treated as a final checkpoint. This means human experts setting thresholds, designing tests, auditing workflows, and deciding when AI should not be relied on, not just reviewing individual outputs after the fact.
2. Don’t rely on human verification alone. Because human verification of every AI output doesn’t scale, organizations committed to responsible oversight should invest in a variety of approaches that use automated tools to extend or augment human judgment. Human verification should be emphasized where human judgment is essential, including edge cases, high-stakes decisions, and novel contexts, while automated tools can handle the remaining volume of tasks. The goal is a combined system that extends human judgment at scale rather than either replacing or being bottlenecked by it.
3. Invest in human expertise. Organizations should invest in human expertise to verify the outputs of AI systems and provide ongoing oversight over how systems are designed and whether they are working as intended. In fact, as technical capabilities grow, the need for human expertise only increases. If junior staff never develop independent judgment and senior employees’ expertise atrophies because they are not part of this process, the organization risks losing its ability to govern AI systems. This may mean maintaining human involvement in processes or tasks that build expertise and judgment, even when they could be automated with AI. In these cases, the efficiency gains that are forgone should be viewed as strategic investments in the future.
4. Verify what is learned, not just what is produced. Organizations tend to focus verification on whether an AI system’s outputs are correct, but they also need to scrutinize the lessons they draw from AI deployments and outcomes. When teams interpret pilot results, measure performance gains, or decide what worked and what didn’t, those conclusions become the foundation for future investments, scaling decisions, and organizational narratives about AI’s value. If those lessons are flawed (the wrong metrics were tracked, edge cases were ignored, or success was declared prematurely), organizations risk perpetuating bad assumptions at increasing scale. Human experts should be involved not only in verifying what AI systems produce but in critically evaluating what the organization believes it has learned from deploying them.
5. Treat verification as a strategic imperative, not just a responsibility practice. According to a global executive survey conducted in 2025 by MIT Sloan Management Review and BCG, 86% of top management teams consider AI to be a significant part of their strategic priorities. When AI is central to how an organization competes, grows, and makes decisions, the quality of human oversight directly affects strategic outcomes, not just ethical ones. Flawed outputs, unchecked deployments, and poorly drawn lessons don’t just create responsibility risks; they lead to misallocated resources, failed initiatives, eroded competitive position, and lost customer trust. The preceding recommendations — verifying designs, combining human and automated oversight, investing in expertise, and scrutinizing what is learned — are not merely aspirational additions to a responsible AI program. They are preconditions for effective strategic management.






