MedSkillAudit: Medical Research AI Skill Audit Framework by AIPOCH

What is MedSkillAudit?

MedSkillAudit is a domain-specific audit framework built by AIPOCH to evaluate whether a medical research agent skill is ready for release before deployment. It combines two hard veto gates with a two-layer scoring system: static design evaluation and dynamic task-based testing. The result is a veto-aware quality score, a deployment disposition, and structured review artifacts for human inspection. In a validation study of 75 skills across five skill categories, MedSkillAudit reached an ICC(2,1) of 0.449 against expert consensus scores, compared with a human inter-rater baseline of 0.300. MedSkillAudit is not a replacement for scientific peer review or clinical expert judgment; it is a pre-deployment quality-control layer for agent skills.

MedSkillAudit

Why Does Medical Research Need a Domain-Specific Skill Audit?

Medical research agent skills carry risks that general-purpose software testing often cannot detect, including fabricated citations, unsupported biomedical claims, flawed statistical reasoning, unsafe diagnostic or treatment-adjacent language, and overconfident interpretation of weak evidence. AIPOCH developed MedSkillAudit after observing that existing skill-quality checks usually focus on formatting, runtime behavior, or generic software reliability, while missing domain-specific scientific failures that matter in biomedical research.

Medical research agents are increasingly assembled from modular skills that handle literature screening, statistical analysis, protocol design, and manuscript drafting. AIPOCH CEO Huimei Wang summarized the gap this creates: "AI agents are becoming part of the scientific workflow, yet there is still no equivalent of a quality-control checkpoint for the skills they rely on." In AIPOCH's own validation study, 57.3% of the 75 evaluated skills fell below the Limited Release threshold, which is the kind of failure rate a domain-specific pre-deployment check is designed to surface before a skill reaches a research workflow.

What Does MedSkillAudit Do?

MedSkillAudit accepts a packaged agent skill as input and runs it through an eight-step audit pipeline. The pipeline helps researchers and skill maintainers identify design flaws, scientific integrity risks, unsafe scope expansion, and runtime failures before a skill is released. For each audit, MedSkillAudit produces two structured outputs: a human-readable eval_viewer Markdown report for manual review and a machine-readable eval_report JSON file for dashboards, repositories, or CI-style quality tracking. AIPOCH frames MedSkillAudit as a workflow support tool: it surfaces flags, scores, and evidence for review, but it does not make final scientific or clinical judgments.

MedSkillAudit can be run as the skill-auditor workflow against a packaged SKILL.md-based skill, producing both human-readable and machine-readable audit outputs. The auditor is available in AIPOCH's GitHub repository as the skill-auditor skill package under an MIT license.

How Does MedSkillAudit Work?

Want a quick overview before diving into the details? Watch the video below to see how MedSkillAudit evaluates medical research AI agent skills through its layered audit framework.

Veto Gates

MedSkillAudit's veto gates are two layers of hard checks that can reject a skill outright before any score is calculated, regardless of how well that skill performs elsewhere. The first layer, Skill Veto, applies to every skill in every category. The second layer, Research Veto, applies only to the four research categories and is skipped for general-purpose skills.

Skill Veto checks four dimensions:

Operational Stability
Structural Consistency
Result Determinism
System Security

Take the agent skill “Clinical Data Cleaner” as an example：

Skill Veto

Research Veto

Take the agent skill “Clinical Data Cleaner” as an example:

research veto

Scientific Integrity
Practice Boundaries
Methodological Ground
Code Usability

Static Evaluation

Take the agent skill “Clinical Data Cleaner” as an example：

Static Evaluation

Evaluates a skill’s design and contract against key dimensions such as Functional Suitability, Reliability, Performance & Context, Agent Usability, Human Usability, Security, Agent-Specific and Maintainability.

Dynamic Evaluation

Take the agent skill “Clinical Data Cleaner” as an example：

dynamic evaluation

Assesses actual outputs of a skill with layered criteria.

For skill testing, the AI automatically generates inputs. The number of inputs in specific categories will increase or decrease depending on the complexity of the skill. The following 7 inputs represent the most comprehensive version.

Canonical
Variant A
Edge
Variant B
Stress
Scope Boundary
Adversarial

Skill Complexity Classification

Label	Code/Rank	Definition	Number of Inputs
Simple	S	Narrow task scope	3 inputs
Moderate	M	Moderate branching or multiple task types	5 inputs
Complex
C	Broad or multi-step specialized skill	7 inputs

How Is the MedSkillAudit Final Score Calculated?

MedSkillAudit calculates the final numeric score as:

Final Score = Static Score × 40% + Dynamic Score × 60%

The score is then mapped to a four-tier deployment disposition: 85–100 is Production Ready, 75–84 is Limited Release, 60–74 is Beta Only, and below 60 is Reject. The final deployment disposition is veto-aware: if a skill fails a hard veto gate, it can be rejected regardless of the numeric score.

Take the agent skill “Clinical Data Cleaner” as an example：

final score

You can view evaluation results for selected AIPOCH skills here.

MedSkillAudit Gives Medical Research Skills a Structured Pre-Deployment Check

MedSkillAudit can assist researchers and skill maintainers in identifying design gaps, scientific integrity risks, and runtime failures in a medical research agent skill before that skill is deployed, using two veto gates plus a combined static and dynamic score. The framework's validation study, covering 75 skills across five medical research categories, found that its scoring aligned with expert reviewers more closely than the reviewers aligned with each other, with an ICC(2,1) of 0.449 against a human baseline of 0.300.

AIPOCH is an open collection and workflow framework for medical research agent skills, designed to support AI-assisted biomedical research workflows across literature review, evidence organization, protocol design, bioinformatics preprocessing, data analysis support, and research writing.The MedSkillAudit auditor itself, along with the broader skill collection, is available on AIPOCH's GitHub.

FAQ

What is MedSkillAudit?

MedSkillAudit is a domain-specific evaluation framework for medical research agent skills. AIPOCH developed MedSkillAudit, a layered framework assessing skill release readiness before deployment.

How does MedSkillAudit's veto gate work?

MedSkillAudit applies two veto layers — Skill Veto (operational stability, structural consistency, determinism, security) for every skill, and Research Veto (scientific integrity, practice boundaries, methodology, code usability) for the four research categories — and a failure in either layer can reject a skill before scoring begins.

How is the MedSkillAudit final score calculated?

The final score is Static Score × 40% plus Dynamic Score × 60%, mapped to a disposition tier: 85–100 is Production Ready, 75–84 is Limited Release, 60–74 is Beta Only, and below 60 is Reject.

Is MedSkillAudit as reliable as a human expert reviewer?

In AIPOCH's validation study of 75 skills, MedSkillAudit reached an ICC(2,1) of 0.449 against the audited skills' consensus expert scores, compared to a human inter-rater ICC of 0.300, with no statistically significant directional bias (Wilcoxon p = 0.613).

How many test inputs does MedSkillAudit generate for dynamic testing?

The number scales with declared skill complexity: 3 inputs for Simple skills, 5 for Moderate skills, and up to 7 for Complex skills, drawn from categories including Canonical, Variant A, Edge, Variant B, Stress, Scope Boundary, and Adversarial.