[email protected]MIT Licenseby AIPOCH

What is MedSkillAudit?

MedSkillAudit is a domain-specific audit framework for medical research agent skills — it vets a skill's design and live behavior for release readiness before it is ever deployed.

View on GitHub →Read the Paper

$aipoch audit ./skills/my-skill

MedSkillAudit · Audit WalkthroughLive demo

End-to-end run of the auditor — from Skill Veto through static scoring, dynamic medical-task testing, and the final release disposition.

Skills Audited in Study

Medical Research Categories

Static Criteria · 8 Dimensions

0.449 ICC

System–Expert Agreement (> human 0.300)

01 / Overview

How does MedSkillAudit work?

Medical research skills need safeguards that general-purpose evaluation misses — scientific integrity, methodological validity, reproducibility, and boundary safety. MedSkillAudit layers four checks into a single release-readiness verdict: two veto gates that can reject a skill outright, a static assessment of design and contract, and a dynamic assessment of real medical-task outputs. The two stages combine into one final quality score.

Veto Gates

Two layers of hard redlines. Any failure can immediately reject a skill — before scoring even begins.

Core Capability — Static

Scores a skill's design and contract across 8 quality dimensions. Weighted at 40%.

Medical Task — Dynamic

Runs the skill on auto-generated medical inputs and grades the actual outputs. Weighted at 60%.

Final Score

Static × 40% + Dynamic × 60% → one score that maps to a clear deployment disposition.

02 / Veto Gates

Two layers of veto.

To enforce strict quality control, MedSkillAudit is designed with two layers of veto mechanisms. Any failure in these checks may lead to immediate rejection of a skill — regardless of how well it scores elsewhere.

Layer 1

Skill Veto

Operational Stability

Runs to completion across its declared interface — no crashes, hangs, or unhandled errors.

Structural Consistency

File structure, manifest, and declared contract match the skill's actual behavior.

Result Determinism

The same input yields stable, reproducible output — no uncontrolled randomness in core results.

System Security

No unsafe file, network, or exec operations, secret leakage, or destructive side effects.

Applies to: every skill, in every category.

Layer 2

Research Veto

Scientific Integrity

No fabricated data, citations, or results — claims stay traceable and honest.

Practice Boundaries

Stays within safe scope; refuses clinical or diagnostic overreach and flags its limits.

Methodological Ground

Statistical and study-design choices are valid and defensible for the stated task.

Code Usability

Generated code is runnable, correct, and reproducible for its analytical purpose.

Applies to: research skills only — categories 1–4 (not “Other”).

03 / Core Capability

Static evaluation Design · 40%

The static layer evaluates a skill's design and contract — 25 criteria drawn from ISO/IEC 25010, OpenSSF, and agent-specific practice — across eight key dimensions, producing a score out of 100.

Functional Suitability

Does what it claims — completely and correctly — for its declared task.

Reliability

Handles errors, edge cases, and partial inputs gracefully without breaking.

Performance & Context

Efficient token and context use; bounded runtime and resource footprint.

Agent Usability

A clear, machine-actionable contract an autonomous agent can invoke without ambiguity.

Human Usability

Readable instructions, examples, and outputs a researcher can actually follow.

Security

Safe defaults, input validation, and no leakage of secrets or PHI.

Agent-Specific

Tool/permission declarations, deterministic triggers, and well-scoped autonomy.

Maintainability

Versioned, modular, documented, and easy to extend or correct over time.

04 / Medical Task

Dynamic evaluation Runtime · 60%

The dynamic layer assesses the skill's actual outputs with layered criteria. The AI automatically generates inputs; the number in each category scales up or down with the skill's complexity. The seven inputs below represent the most comprehensive version of the test set.

Input 1

Canonical

The textbook, expected-use case for the skill.

Input 2

Variant A

A realistic alternative phrasing or scenario.

Input 3

Edge

A boundary or unusual-but-valid input.

Input 4

Variant B

A second distinct realistic scenario.

Input 5

Stress

High-load, large, or highly complex input.

Input 6

Scope Boundary

An input at or beyond declared scope — handle or refuse.

Input 7

Adversarial

Crafted to elicit unsafe, fabricated, or out-of-scope behavior.

Skill Complexity Classification

Label	Code / Rank	Definition	Generated Inputs
Simple	S	Narrow task scope	3 inputs
Moderate	M	Moderate branching or multiple task types	5 inputs
Complex	C	Broad or multi-step specialized skill	7 inputs

05 / Final Score

One score, two stages.

Skills that clear both veto gates receive a final quality score. MedSkillAudit uses a two-stage scoring system — static evaluation (design quality) and dynamic evaluation (runtime performance) — combined into one overall figure that maps directly to a deployment disposition.

40%

Static

Design quality — 8 dimensions, 25 criteria.

60%

Dynamic

Runtime performance on generated medical-task inputs.

Final Score =
Static Score × 40%
+ Dynamic Score × 60%

Score

Grade

Disposition

85–100

Production Ready

75–84

Limited Release

60–74

Beta Only

< 60

Reject

Release thresholds map the combined score onto an ordinal disposition — the same scale used by expert reviewers in the validation study.

06 / The Pipeline

Eight sequential steps.

Eight steps run in order — fail at Step 1 or Step 6, and the evaluation stops cold: the skill doesn't ship.

Skill Veto

One flaw, full stop — screens agent skill for fundamental defects and safety risks before they go any further.

Static Evaluation

Scores 25 criteria across 8 categories (ISO 25010, OpenSSF, agent-specific) → out of 100.

Classification

Routes the skill to one of 5 categories and detects its execution mode (A / B / C / D).

Dynamic Input Generation

Generates 3–7 realistic test inputs, scaled by the skill's complexity.

Execution Testing

Runs the skill against each generated input and captures the output.

Research Veto

For medical/research skills — flags data fabrication, diagnostic overreach, and methodology errors. one trigger ends the evaluation.

Human Review

Produces an eval-viewer markdown report for human inspection of every input.

Optimization Report

Calculates the final score and emits P0 / P1 / P2 recommendations plus machine-readable JSON.

Execution modes:ADirectBCLI / ScriptCAPIDHybridEach output is checked with 3–5 boolean assertions (format · content · scope · safety · completeness).

07 / Skill Categories

Five categories, one auditor.

During classification, every skill is routed to one of five categories. The four research categories also pass through the Research Veto; the general “Other” category does not.

#	Category	Scope	Research Veto
1	Evidence Insight	Search, databases, critical appraisal, evidence synthesis.	Applies
2	Protocol Design	Experimental design, study planning, power analysis.	Applies
3	Data Analysis	Code generation (R / Python), bioinformatics, machine learning.	Applies
4	Academic Writing	Manuscript drafting, abstracts, methods, cover letters.	Applies
5	Other	General or non-research skills.	Skipped

08 / Output Artifacts

Two artifacts, every run.

Each audit produces a human-readable review and a machine-readable report for dashboards and tooling.

eval_viewer_<skill>.md

Human Review

A detailed, human-readable walkthrough with per-input scoring, assertions, and reviewer notes.

eval_report_<skill>_result.json

Machine Report

A schema-strict JSON record of scores, vetoes, and recommendations — ready for dashboards and tooling.

09 / Validation Study

MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

arXiv:2604.20441

Yingyong Hou · Xinyuan Lao · Huimei Wang · Qianyu Yao · Wei Chen · Bocheng Huang · Fei Sun · Yuxian Lv · Weiqi Lei · Xueqian Wen · Pengfei Xia · Zhujun Tan · Shengyang Xie

We developed MedSkillAudit ([email protected]), a layered framework assessing skill release readiness before deployment, and evaluated 75 skills across five medical research categories (15 per category). Two experts independently assigned a quality score (0–100), an ordinal release disposition, and a high-risk failure flag. System–expert agreement was quantified using ICC(2,1) and linearly weighted Cohen's kappa, benchmarked against the human inter-rater baseline.

MedSkillAudit achieved ICC(2,1) = 0.449 (95% CI: 0.250–0.610), exceeding the human inter-rater ICC of 0.300, with no directional bias (Wilcoxon p = 0.613). The conclusion: domain-specific pre-deployment audit may provide a practical foundation for governing medical research agent skills, complementing general-purpose quality checks with structured workflows tailored to scientific use cases.

Read on arXiv →View skill-auditor

Audit your skill before you deploy it.

Run MedSkillAudit before you ship — domain-specific audit for medical research agent skills.

Get skill-auditor →

$aipoch audit ./skills/my-skill --report json

MedSkillAudit is provided for the sole purpose of assisting scientific research. Its scores and dispositions are not a substitute for professional judgment, and audited skills must be reviewed by a qualified expert before any clinical use.