Operationalizing Responsible Artificial Intelligence (AI) in Clinical Research

Executive Summary

Artificial intelligence (AI) is rapidly becoming embedded across the clinical development lifecycle, offering significant opportunities to reduce costs, accelerate timelines, improve operational efficiency, and enhance the quality of clinical evidence.1 AI-enabled tools are already supporting protocol design, feasibility assessment, site selection, enrollment forecasting, data monitoring, and reporting. At the same time, more advanced approaches, such as virtual evidence generation, in silico control arms, and digital-twin models, are expanding how evidence can be generated, analyzed, and contextualized. These developments are fueled by increased access to real-world data, advances in analytics, and scalable cloud infrastructure, and they are reshaping how clinical trials are designed and executed.

Despite this momentum, governance approaches have not kept pace with how AI is actually being used in clinical research. Many organizations continue to rely on uniform or tool-agnostic controls that fail to distinguish between low-risk decision support and highly automated systems that operate at scale. Treating all AI in clinical research as equally risky is both inefficient and unsafe. Effective governance must scale with how autonomously AI systems operate and how directly they affect participants and clinical evidence.

To address this gap, this white paper introduces a practical, risk-based framework for governing AI across clinical development (Phases I–IV). The framework evaluates AI applications along two dimensions: level of autonomy, reflecting how independently a system initiates or executes actions, and degree of patient impact, reflecting the extent to which AI outputs influence participant safety, access to trials, monitoring strategies, and the validity and interpretation of clinical evidence. Together, these dimensions provide a structured way to assess real-world risk based on consequence, not technical complexity alone.

As AI becomes embedded in core clinical workflows and evidence-generation processes, the cost of getting governance wrong increases sharply. Overly restrictive controls can slow innovation and delay meaningful gains in trial efficiency, while insufficient oversight can compromise participant safety, equity, and regulatory confidence. This framework enables proportional governance aligned with potential consequence, allowing low-risk applications to be managed through routine oversight while higher-impact uses receive escalating levels of review and validation. In doing so, the framework helps organizations strike the right balance, enabling progress where risk is manageable and discipline where stakes are high.

Introduction

The Accelerating Adoption of AI in Clinical Development

Artificial intelligence (AI) is rapidly gaining traction across clinical development as organizations seek to reduce costs, streamline operations, accelerate decision-making, and improve the quality of clinical evidence.

AI-enabled systems such as protocol benchmarking and complexity analysis tools now support faster protocol development, while machine-learning–based feasibility and site selection platforms improve study planning and execution. Predictive enrollment models and automated reporting tools further accelerate trial operations by enabling more accurate forecasting and continuous monitoring.

In parallel, advances in virtual evidence generation, in silico control arms, and digital-twin models are expanding the scientific possibilities of clinical research. These approaches enable more precise modeling of patient trajectories, simulation of treatment effects prior to enrollment, and generation of complementary evidence alongside traditional trial data.

This momentum is driven by increased access to real-world data, including electronic health records and omics data, as well as scalable cloud infrastructure that has lowered barriers to adoption.

As AI becomes more deeply embedded in clinical research, the need for a clear, responsible, and patient-centered framework to guide its safe and reliable integration has become increasingly urgent.

Emerging Risks and Evidence of Algorithmic Bias in Clinical Trials

The accelerating adoption of AI has also surfaced a parallel set of risks that demand careful attention, particularly as these tools begin to influence decisions that shape clinical research. Numerous studies have shown that AI models, whether diagnostic algorithms, risk-prediction tools, or large language models (LLMs), often perform unevenly across populations and geographies, with some systems underestimating risk in underserved communities, misclassifying clinical phenotypes, or delivering less accurate results in certain geographic regions.

In the context of clinical trials, for example, such biases can have profound consequences: recruitment algorithms may inadvertently exclude eligible participants, risk models may fail to flag individuals needing closer monitoring, and site-selection tools may reinforce existing structural inequities. These patterns threaten participant safety, study representativeness, and regulatory confidence, while also undermining public trust in both AI and the clinical research enterprise.

Regulatory Momentum and Policy Expectations

Regulatory bodies around the world are also rapidly advancing their expectations for how AI should be developed, validated, and monitored within clinical research.8 For example, the FDA continues to expand guidance on AI and machine learning tools, emphasizing the need for transparency, meaningful human oversight, and proactive management of model drift. The EMA is similarly exploring how AI may influence medicinal product evaluation and safety surveillance, while the ICH M11 protocol template creates new opportunities, and new responsibilities, for AI-driven authoring and structured study design.9,10 The EU AI Act introduces a formal classification of high-risk AI systems, along with stringent documentation and performance requirements that directly affect clinical trial applications.11 Across agencies, a clear trend is emerging: regulators increasingly expect organizations to implement robust process-level governance, ongoing monitoring, and validation approaches that reflect the specific context in which each AI system is deployed.

There are compelling reasons to keep CDAs minimally burdensome. First, as described, CDAs are preliminary in scope and are not intended to create an ongoing operational relationship; instead, they enable the initial exchange of confidential information prior to any formal collaboration. Second, they have a direct operational impact on study startup timelines. Internal reviews of sponsor and CRO metrics indicate that delays in CDA execution can delay study startup at a site by weeks or even months. Third, despite the competitive and regulated nature of the clinical research industry, most stakeholders agree that excessive negotiations of CDAs contribute little to legal protection Optimizing how CDAs are reviewed and executed can have a material impact on overall trial timelines.

Master CDAs—pre-negotiated agreements between a sponsor/CRO and a site/institution—have proven to be the most efficient solution, as they cover confidentiality needs for future feasibility activities and studies. However, operational efficiency is also affected by whether CDAs are unilateral (protecting only sponsor information, namely the clinical trial protocol) or bilateral (protecting both sponsor and site information, such as sites’ competitive advantages related to patient populations). Although creating the simplest possible CDA template is a goal, true efficiency is only achieved with bidirectionality. Bilateral (mutual) CDAs can help eliminate delays at sites that require mutual protection.

Industry Readiness and Operational Challenges

The clinical research ecosystem is still developing the capabilities needed to meaningfully and responsibly integrate AI into routine practice. Many organizations face persistent AI literacy gaps across clinical, medical, and operational teams, making it difficult to evaluate vendor claims, let alone identify appropriate use cases.12 Governance structures also remain immature: accountability is often unclear, vendor assessments are inconsistent, and documentation standards vary widely. This Council for Responsible Use of AI in Clinical Trials found in our own work across pharmaceutical sponsors, the clinic, and sites is that even when early pilots show promise, organizations frequently struggle to scale solutions into production. These challenges are compounded by limitations in data quality, interoperability, privacy protections, and cybersecurity readiness, as well as barriers such as trust gaps, fear of automation, and uncertainty about evolving roles and responsibilities.13 Together, these factors underscore the urgent need for practical, implementable frameworks that can support responsible AI adoption across diverse clinical research settings, including qualification, oversight, and ongoing monitoring.

Why Action Is Needed Now

Taken together, the rapid pace of AI adoption, emerging regulatory expectations, and persistent readiness gaps illustrate a widening disconnect between how quickly AI tools are being developed and deployed, and how prepared organizations are to govern them responsibly. Regulators, sponsors, and patients are increasingly calling for transparency, fairness, and oversight, yet many current practices remain ad hoc or inconsistent. Without clear safeguards, there is a real risk that AI could amplify existing health disparities or compromise the integrity of clinical research.14 At the same time, sponsors have a unique opportunity to capture early, low-risk, high-ROI use cases, while building the long-term capacity and governance structures needed for more complex, higher-impact applications. This combination of urgency and opportunity makes now the critical moment to establish a principled, practical framework for responsible AI in clinical development.

Framework Overview

How to Apply the Framework in Practice

This framework is intended to support concrete decisions during AI evaluation, approval, and oversight. In practice, organizations can apply it at three recurring decision points:

Use-case intake

When a new AI tool or capability is proposed

Study design and planning

When AI is incorporated into a protocol or operational plan

Ongoing oversight

When monitoring performance, drift, or emerging risks over time

At each point, teams should:

Classify the AI system’s autonomy and patient impact in its specific context of use
Assign the use case to a quadrant
Apply the corresponding governance, qualification, and monitoring expectations

Classification should be revisited whenever the system’s autonomy, data inputs, or role in decision-making changes.

How to Apply the Framework in Practice

Purpose, Scope, and Intended Audience

This framework is designed to provide practical guidance for the responsible use of AI within clinical development. Its scope is intentionally limited to Phases I–IV, focusing on the design, conduct, analysis, and reporting of clinical trials. It does not address preclinical discovery or post-market surveillance except in cases where AI tools directly intersect with clinical trial operations. The goal is to offer actionable, decision-support guidance that can be applied across the wide range of AI use cases emerging in clinical research, i.e., how AI should be evaluated, implemented, and governed throughout the clinical development lifecycle.

The framework is intended for the broad community of stakeholders involved in planning, executing, and overseeing clinical research. This includes clinical operations and trial management teams who must evaluate the operational feasibility of AI tools; data science, biostatistics, and informatics groups responsible for model development and validation; quality assurance and compliance teams charged with ensuring audit readiness; regulatory and medical governance leaders who interpret evolving policy expectations; and clinical development executives who set strategic direction. It is also relevant to technology vendors and solution providers seeking to align their offerings with responsible AI practices. Importantly, the framework is designed to support organizations at any stage of AI maturity, from those beginning to explore use cases to those integrating advanced, high-autonomy systems into routine workflows.

Risk Categorization Framework

The proposed framework categorizes AI applications based on two dimensions: level of autonomy and patient impact (Figure 1). The AI Council selected these dimensions because level of autonomy (human-in-the-loop) and impact to patients are the two most critical determinants of risk, accountability, and real-world value in healthcare AI. Level of autonomy directly governs clinical safety, regulatory exposure, and ethical responsibility, while impact to patients determines clinical relevance, harm potential, and outcome significance. These dimensions were prioritized over others because they cut across all AI use cases regardless of technology, and provide a clear, decisiongrade framework for distinguishing between low-risk operational automation and high-stakes clinical intelligence that requires rigorous oversight, validation, and governance.

Level of Autonomy refers to the extent to which an AI system independently executes or adapts actions over time in pursuit of predefined objectives, with varying degrees of human involvement. Autonomy captures where and how decisions are made, rather than whether humans remain ultimately accountable for outcomes. Low-autonomy systems primarily support or inform human decision-making, while high-autonomy systems initiate actions or recommendations with limited real-time human intervention. As autonomy increases, errors can propagate more rapidly, bias may persist longer before detection, and opportunities for contextual clinical judgment become more constrained. Even when outputs are reviewed after the fact, highly autonomous systems may have already shaped workflows, eligibility pathways, or operational decisions. Importantly, autonomy also alters the risk profile even when patient impact appears similar. For example, a human-reviewed eligibility recommendation and an automated eligibility exclusion rule both influence who enters a trial, but the latter operates at greater scale, may run continuously, and can systematically exclude populations before meaningful review occurs. Autonomy therefore affects not only what decisions are influenced, but the speed, scale, and reversibility of those effects, making it a critical dimension for calibrating governance, oversight, and monitoring. Looking ahead, this dimension becomes increasingly important as AI systems evolve toward more adaptive and agent-like behaviors, reinforcing the need for governance models that can scale with increasing autonomy over time.

Figure 1. Risk-Based Framework for Governing AI Use in Clinical Development

Patient Impact captures the degree to which an AI system’s outputs may influence participant safety or shape who is able to participate in a trial, how participants are monitored and experience the study, and how trial results are interpreted to inform future patient care, regardless of whether final decisions remain human-led. The concept of patient impact is aligned with established human subjects research principles, reflecting the potential magnitude and nature of harm or burden to participants arising from study design, conduct, or analysis, whether those effects are direct or indirect.15 Importantly, patient impact is not binary; it exists along a spectrum from low (i.e., errors or bias are unlikely to affect participant safety and can be readily detected and corrected through routine oversight) to high (i.e. errors may propagate across participants or studies, are difficult to reverse once implemented, or have the potential to materially influence participant safety and the validity of the evidence generated), depending on the severity, scale, and reversibility of potential consequences. Recognizing this spectrum is essential for distinguishing between AI uses that warrant streamlined governance and those that require heightened review, safeguards, and ongoing monitoring.

As illustrated in the 2×2 matrix in Figure 1, the framework organizes AI use cases across clinical development into four quadrants based on level of autonomy and degree of patient impact, with increasing autonomy driving greater speed, scale, and difficulty of human intervention, and increasing patient impact reflecting greater severity, scope, and irreversibility of potential harm. In the low autonomy, low patient impact quadrant, AI applications primarily support or inform human decisions, with limited and reversible effects, and therefore require streamlined qualification and routine oversight. The high autonomy, low patient impact quadrant includes highly automated systems that act at scale or continuously but have limited direct influence on participant safety or evidence generation, warranting formal qualification with defined performance thresholds and periodic monitoring. In contrast, the low autonomy, high patient impact quadrant captures AI tools that inform human-led decisions shaping trial design, participant access, or result interpretation, where effects are indirect but can materially influence safety or scientific validity, necessitating cross-functional review, contextual validation, and heightened oversight. Finally, the high autonomy, high patient impact quadrant represents AI systems that independently influence eligibility, safety oversight, or evidence generation, where effects may be systemic, difficult to reverse, and highly sensitive to bias or drift, and therefore require study-specific qualification, independent validation, continuous monitoring, and robust human-in-the-loop controls.

Practical Recommendations by Quadrant

Below are practical, actionable recommendations for each quadrant of the framework, aligned with the level of autonomy and degree of patient impact. By mapping AI use cases to these quadrants, the framework enables organizations to calibrate governance, oversight, and monitoring proportionate to risk, ensuring responsible deployment aligned with ethical, scientific, and regulatory expectations.

Low Autonomy, Low Patient Impact

AI use cases in the low autonomy, low patient impact quadrant primarily support or inform human decision-making and have limited, indirect, and readily reversible effects on participants or the validity of clinical evidence. As such, these applications can be governed through routine oversight, defined as the same level of governance and monitoring that organizations already apply to standard, low-risk clinical systems and decision-support tools, without introducing AI-specific escalation or bespoke controls. In practice, routine oversight begins with standard qualification at onboarding, including a basic assessment of vendor or tool fit-forpurpose, data sources, and intended use; confirmation that outputs are understandable and traceable; documentation
of context of use and known limitations; and alignment with existing SOPs and quality processes. These activities are typically managed through established IT, clinical operations, or vendor management pathways rather than specialized AI review committees. Human review and accountability remain intact, with AI outputs serving an advisory role and users retaining full authority to accept, modify, or override recommendations using existing governance mechanisms.

Ongoing oversight should consist of periodic, lightweight performance checks (e.g., such as scheduled reviews or spot checks against historical benchmarks) to confirm outputs remain reasonable and aligned with expectations. Updates or model refreshes are handled through standard change-control processes, with reassessment triggered only when material changes or issues exceed predefined thresholds. Finally, basic training and documentation ensure users understand how to interpret outputs appropriately, including key assumptions and limitations, without requiring specialized AI expertise. Together, these measures provide sufficient assurance for low-risk applications while avoiding unnecessary complexity or governance burden.

Example:

A practical low-autonomy, low patient impact application is the use of AI to support early study planning through protocol benchmarking and complexity analysis. These tools analyze large libraries of historical protocols to identify recurring design patterns and operational features that have been associated with downstream execution challenges, such as frequent amendments, prolonged start-up timelines, or site burden. Examples of such patterns include highly restrictive or rarely met eligibility criteria, unusually dense visit schedules, redundant or low-value assessments, extensive manual data collection requirements, or procedural elements that require specialized equipment or uncommon site capabilities. By surfacing these patterns based on observed associations in prior studies (rather than prescriptive recommendations), the AI helps study teams anticipate operational risk and consider potential simplifications early in planning. Importantly, these insights are descriptive and advisory, do not replace scientific, medical, or ethical judgment, and are used upstream of formal protocol finalization and ethics review. When applied in this manner, protocol benchmarking improves feasibility and operational readiness while requiring minimal autonomy and posing little direct impact on participants.

High Autonomy, Low Patient Impact

AI use cases in this quadrant are characterized by a high degree of automation, with systems initiating actions or generating outputs at scale or on a continuous basis, while having limited direct influence on participant safety, eligibility, or the interpretation of clinical outcomes. Despite their relatively low patient impact, the elevated level of autonomy introduces operational and process risk, warranting formal qualification and structured periodic monitoring. Formal qualification in this context includes a documented assessment of the system’s intended use, operating boundaries, and performance expectations; validation of accuracy, completeness, and consistency against defined benchmarks; confirmation of data provenance and integrity; and verification that outputs are traceable, auditable, and appropriately controlled within downstream workflows. Roles and responsibilities for system ownership, review, and issue escalation should be clearly defined, with explicit checkpoints for human verification before outputs enter the clinical or regulatory record. Ongoing oversight is achieved through periodic monitoring and audits rather than continuous surveillance, reflecting the low patient impact of these applications. This typically includes scheduled performance reviews, sampling-based audits of outputs, monitoring for unexpected shifts in output patterns or volume, and confirmation that performance remains within predefined thresholds. Model updates, retraining, or configuration changes are managed through established change-control processes, with reassessment triggered by material changes, performance degradation, or audit findings. Targeted user training ensures that staff understand system capabilities, limitations, and review obligations, while audit trails and documentation support inspection readiness. Together, these controls provide assurance that highly automated systems operate reliably and predictably at scale, without introducing undue governance burden for low-impact use cases.

Example:

A representative high-autonomy, low patient impact application is the use of an agentic AI workflow to autonomously manage clinical data-query processes and -quality assurance during trial conduct (i.e., cleaning/correcting trial data before database lock). In this use case, the AI functions as an operational agent operating within predefined objectives and constraints: it continuously monitors incoming data, determines when discrepancies meet action thresholds, generates and assigns queries to sites, prioritizes and escalates them based on timing and response patterns, and adapts subsequent actions as conditions change, all without real-time human direction. Autonomy is high because the system independently plans, executes, and adjusts multi-step operational tasks by default, rather than issuing recommendations for human consideration. Patient impact remains low because these activities affect internal data management workflows rather than participant eligibility, safety oversight, or interpretation of clinical outcomes, and errors are detectable and reversible through established quality controls before database lock. Governance therefore focuses on formal qualification, predefined guardrails, periodic audits, and explicit human override authority, ensuring that agentic operation remains bounded, transparent, and accountable.

Low Autonomy, High Patient Impact

AI use cases in this quadrant support human-led decisions but meaningfully influence trial design, participant access, monitoring strategies, or the interpretation of clinical results, creating the potential for significant downstream effects on participant safety, equity, or scientific validity. Although these systems do not act autonomously, their outputs shape high-consequence decisions and therefore require cross-functional review, heightened documentation, and active oversight. Cross-functional review means that evaluation and approval extend beyond a single function
and include representation from clinical development, clinical operations, biostatistics or data science, regulatory or medical governance, quality, and, where appropriate, ethics or patient-focused stakeholders. This review should occur at defined decision points, such as initial use-case

approval, protocol finalization, major study milestones, and whenever material changes to the model, data inputs, or context of use are introduced, rather than on an ad hoc basis. Heightened documentation goes beyond basic onboarding records and includes clear articulation of the context of use, decision boundaries, underlying assumptions, known limitations, and potential sources of bias; rationale for how AI-informed recommendations were considered or overridden; and evidence of validation against relevant historical or study-specific data. Oversight in this quadrant refers to sustained, active governance rather than one-time approval, including defined accountability for monitoring performance and relevance over time, review of emerging risks or unintended consequences, and escalation pathways if outputs appear inconsistent with clinical expectations or equity objectives. Together, these measures ensure that AI tools informing high-impact decisions are used transparently, consistently, and responsibly, while preserving human judgment and aligning decision-making with ethical, scientific, and regulatory standards.

Example:

A representative low-autonomy, high patient impact use case is the use of AI to support clinical teams in identifying participants who may require enhanced safety monitoring during an ongoing trial. In this scenario, AI models analyze accumulated trial data, such as adverse event history, laboratory trends, concomitant medications, or prior protocol deviations, to surface participants who may be at elevated risk and warrant closer clinical attention. The AI does not initiate changes to monitoring plans or make safety decisions; instead, it provides decision support to investigators or medical monitors, who determine whether additional assessments, follow-up, or interventions are appropriate. Autonomy is low because all actions remain fully human-led and subject to clinical judgment. Patient impact is high because monitoring intensity directly affects participant safety, burden, and experience, and systematic bias or error in these recommendations could lead to over- or under-monitoring of certain individuals or groups. As a result, this use case requires cross-functional review, careful validation of assumptions, and active oversight to ensure that AIinformed insights are applied equitably and in alignment with patient safety and ethical standards.

High Autonomy, High Patient Impact

AI use cases in this quadrant are distinguished by both a high degree of automation and a direct, material influence on participant eligibility, safety oversight, or the generation and interpretation of clinical evidence. In this context, independently influencing eligibility or safety means that the system initiates recommendations or actions, such as inclusion or exclusion flags, risk stratification, or monitoring intensity adjustments, that may be applied at scale or in near real time, with limited opportunity for contemporaneous human deliberation before effects occur. Similarly, AI-enabled evidence generation, including approaches that shape control populations, outcome interpretation, or analytical conclusions, is a critical consideration because errors, bias, or model drift can systematically affect trial validity and downstream clinical decision-making, and may be difficult or impossible to reverse once data are generated, analyses are locked, or regulatory submissions are made. These characteristics make such systems highly sensitive to bias, data shifts, and contextual mismatch, particularly across populations, sites, or time. As a result, deployment in this quadrant requires study-specific qualification, meaning that validation and risk assessment are performed in the context of a particular protocol, population, and intended use, rather than relying solely on enterprise-level or vendor qualification performed for other studies or settings. Independent validation and review should involve functions that are organizationally and operationally separate from the system’s development or day-to-day operation, such as quality assurance, independent data science reviewers, medical governance, or external experts, to ensure objective assessment of performance, fairness, and assumptions. Continuous monitoring is required to detect drift, emerging bias, or unexpected behavior, with predefined thresholds and escalation pathways that trigger human review and intervention. Human-in-the-loop in this context does not imply passive observation, but rather defined authority and responsibility for clinicians or study leaders to pause, override, or modify AI-driven outputs when monitoring signals indicate potential risk, inconsistency, or inequitable impact. Together, these safeguards reflect the compounded risk introduced when high autonomy and high patient impact intersect, ensuring that AI systems enhance rather than compromise participant safety, scientific integrity, and trust in the clinical research enterprise.

Example:

A representative high-autonomy, high patient impact use case is the use of (highly automated) adaptive randomization systems in clinical trials, where predefined algorithms are used to adjust treatment allocation ratios during study conduct based on accumulating data. In this configuration, interim data are ingested and analyzed on a recurring or continuous basis, and updates to randomization probabilities are applied automatically according to prospectively specified rules, without requiring manual recalculation or affirmative approval for each adjustment. Autonomy is high because the system both evaluates incoming data and executes changes to participant assignment by default within pre-established boundaries, with human oversight exercised through monitoring, predefined pause criteria, and override authority rather than step-by-step intervention. Patient impact is high because these adaptations directly influence which treatment participants receive, their exposure to potential risks and benefits, and the evidentiary basis of the trial. Errors, bias, or model drift could therefore affect participant safety, equity, and trial interpretability, and such effects may be difficult or impossible to reverse once participants have been randomized and treated. While many current implementations of adaptive randomization system retain substantial human mediation through data monitoring committees or statistical review (similar to automated participant risk stratification models presented in a decision-support capacity), this example illustrates a highautonomy configuration that governance frameworks must be prepared to evaluate as automation increases. As a result, this use case warrants rigorous study-specific qualification, independent statistical and ethical oversight, continuous monitoring of operating characteristics, and clearly defined mechanisms for human intervention.

Governance, Operationalization, and the Path Forward

Translating the Framework into Practice

Translating this framework into practice requires organizations to establish governance structures, qualification processes, and implementation foundations that are calibrated to the level of autonomy and patient impact associated with each AI use case. Rather than applying uniform controls, the framework is designed to support proportional governance, ensuring that low-risk applications can be integrated efficiently while higher-risk uses receive the depth of review and oversight warranted by their potential consequences. The first step is defining clear cross-functional governance mechanisms with representation from clinical operations, data science, regulatory, quality, legal, and ethics functions, supported by explicit lines of accountability for development, validation, deployment, and ongoing oversight. For lower-risk use cases, governance may be exercised through existing operational, IT, or quality pathways. For higher-autonomy or higher-impact applications, organizations should establish dedicated cross-functional review bodies and predefined escalation pathways that clarify when additional scrutiny, independent review, or study-specific approval is required.

Once governance pathways are established, organizations must implement qualification and monitoring procedures that reflect the AI system’s context of use. This includes maintaining clear documentation of each tool’s intended purpose, operating boundaries, assumptions, limitations, and potential risks, as well as evidence supporting its performance in the relevant clinical setting. Performance evaluation should consider representativeness across populations and be supported by monitoring processes that are scaled to risk, ranging from periodic performance checks for low-impact applications to continuous monitoring for systems with high autonomy or direct patient impact. Drift, accuracy, explainability, and equity should be assessed at a frequency commensurate with potential consequence, with defined thresholds for escalation and corrective action. All reviews, exceptions, and interventions should be documented within the quality system to support transparency and audit readiness.

Finally, successful operationalization depends on strengthening the organizational foundations for responsible AI adoption. This includes building appropriate AI literacy through targeted, role-based training; supporting change management to integrate AI tools into established clinical workflows; standardizing vendor assessment and contracting expectations; and embedding AI governance within broader digital modernization and quality initiatives. Together, these practices enable organizations to apply the framework consistently across diverse use cases, balancing innovation with ethical responsibility, scientific rigor, and regulatory confidence.

Future Directions for Responsible AI in Clinical Research

Looking ahead, the responsible use of AI in clinical research will increasingly depend on governance models that can scale with rising levels of autonomy and patient impact, rather than remaining tied to static, tool-specific controls. As organizations progress from low-autonomy decision-support tools to more adaptive and autonomous systems that directly shape scientific decisions, participant experience, and evidence generation, traditional oversight approaches will become insufficient on their own. Effective governance will therefore require greater emphasis on centralized accountability, standardized context-ofuse documentation, and mechanisms for monitoring performance and behavior across studies and over time.

In parallel, the growing adoption of simulation-based approaches, including digital twins, in silico control arms, and advanced modeling techniques, will expand how evidence is generated and interpreted in clinical development. While these methods offer significant potential to improve efficiency and scientific insight, they also introduce new challenges related to validation, bias, interpretability, and irreversibility of impact once evidence is generated or regulatory decisions are informed. Addressing these challenges proactively will require qualification and oversight models that explicitly account for both the autonomy of the system and the potential consequences for participants and future patients.

Regulatory expectations are evolving in the same direction. Agencies such as the FDA, EMA, MHRA, PMDA, and ICH are increasingly emphasizing explainability, auditability, lifecycle management, and continuous oversight for AI-enabled systems, alongside efforts to harmonize expectations across jurisdictions. At the same time, the broader research community will need to move toward shared norms for trustworthy AI, including common risk stratification approaches, documentation standards, and performance benchmarks that enable consistent evaluation across organizations and studies. Collaboration, transparency, and the structured exchange of lessons learned, particularly around high-impact and high-autonomy use cases, will be essential to ensuring that AI advances clinical development in ways that are scientifically robust, ethically sound, and worthy of patient trust.

Conclusion

The accelerating integration of AI into clinical development underscores the need for a structured, risk-based approach to evaluating and governing these technologies responsibly. As AI systems move beyond supporting discrete tasks to shaping decisions, workflows, and evidence at scale, organizations can no longer rely on ad hoc or tool-specific controls. By explicitly considering both the level of autonomy and the degree of patient impact, the framework presented in this paper provides a clear and consistent foundation for calibrating qualification, oversight, and monitoring in proportion to risk across the clinical development lifecycle. The framework provides a repeatable model for decision-making and oversight, requiring reassessment as AI systems, data inputs, and roles in decision-making evolve over time, rather than serving as a one-time risk classification. Organizations are encouraged to begin with low-risk, high-value applications, such as AI-supported study design, reporting automation, or feasibility prediction, to capture early benefits while building the governance capabilities, operational discipline, and organizational confidence needed to support more complex and higherimpact use cases over time. Importantly, responsible AI adoption is not a one-time exercise, but an ongoing commitment that requires sustained attention to equity, ethics, scientific rigor, and participant trust as systems evolve and contexts change. As AI becomes an increasingly integral component of clinical research, the council remains committed to advancing shared best practices, promoting transparency, and fostering collaboration across the research ecosystem. By grounding innovation in principled, proportionate governance, the clinical research community can ensure that AI enhances (rather than compromises) the integrity of clinical development and delivers meaningful, patient-centered progress.

Authors

Jonathan Helfgott
Former U.S. Food and Drug Administration Official
Johns Hopkins University

Mohammad Hosseini, MA, PhD
Assistant Professor of Bioethics & Health Humanities
Northwestern University Feinberg School of Medicine

Muhammed Idris
Morehouse School of Medicine
Founder, Site-View

Sid Jain
Senior Vice President, Clinical Development & Data Science
Recursion

Raghu Punnamraju
Chief Technology Officer
Velocity Clinical Research

Ravi Ramachandran
Co-Founder & Chief Science Officer
PeerAI

Michel Rider
Global Head, Digital R&D
Sanofi

Laura Russell
Senior Vice President, Product, AI & Innovation
Advarra

Kimberly Tableman
Consultant

The Council for Responsible Use of AI in Clinical Trials is a cross-industry forum of life sciences leaders committed to advancing responsible, practical AI adoption in clinical research. The Council works collaboratively to define governance standards, align on high-impact use cases, and establish measurable outcomes that ensure AI delivers meaningful value to patients and the broader research ecosystem.

INDUSTRY REPORTS

Operationalizing Responsible Artificial Intelligence (AI) in Clinical Research

Executive Summary

Introduction

The Accelerating Adoption of AI in Clinical Development

Emerging Risks and Evidence of Algorithmic Bias in Clinical Trials

Regulatory Momentum and Policy Expectations

Industry Readiness and Operational Challenges

Why Action Is Needed Now

Framework Overview

How to Apply the Framework in Practice

Use-case intake

Study design and planning

Ongoing oversight

How to Apply the Framework in Practice

Purpose, Scope, and Intended Audience

Risk Categorization Framework

Practical Recommendations by Quadrant

Low Autonomy, Low Patient Impact

High Autonomy, Low Patient Impact

Low Autonomy, High Patient Impact

High Autonomy, High Patient Impact

Governance, Operationalization, and the Path Forward

Translating the Framework into Practice

Future Directions for Responsible AI in Clinical Research

Conclusion

Authors