Veto Gates
Two layers of hard redlines. Any failure can immediately reject a skill — before scoring even begins.
MedSkillAudit is a domain-specific audit framework for medical research agent skills — it vets a skill's design and live behavior for release readiness before it is ever deployed.
aipoch audit ./skills/my-skillEnd-to-end run of the auditor — from Skill Veto through static scoring, dynamic medical-task testing, and the final release disposition.
Medical research skills need safeguards that general-purpose evaluation misses — scientific integrity, methodological validity, reproducibility, and boundary safety. MedSkillAudit layers four checks into a single release-readiness verdict: two veto gates that can reject a skill outright, a static assessment of design and contract, and a dynamic assessment of real medical-task outputs. The two stages combine into one final quality score.
Two layers of hard redlines. Any failure can immediately reject a skill — before scoring even begins.
Scores a skill's design and contract across 8 quality dimensions. Weighted at 40%.
Runs the skill on auto-generated medical inputs and grades the actual outputs. Weighted at 60%.
Static × 40% + Dynamic × 60% → one score that maps to a clear deployment disposition.
To enforce strict quality control, MedSkillAudit is designed with two layers of veto mechanisms. Any failure in these checks may lead to immediate rejection of a skill — regardless of how well it scores elsewhere.
Runs to completion across its declared interface — no crashes, hangs, or unhandled errors.
File structure, manifest, and declared contract match the skill's actual behavior.
The same input yields stable, reproducible output — no uncontrolled randomness in core results.
No unsafe file, network, or exec operations, secret leakage, or destructive side effects.
Applies to: every skill, in every category.
No fabricated data, citations, or results — claims stay traceable and honest.
Stays within safe scope; refuses clinical or diagnostic overreach and flags its limits.
Statistical and study-design choices are valid and defensible for the stated task.
Generated code is runnable, correct, and reproducible for its analytical purpose.
Applies to: research skills only — categories 1–4 (not “Other”).
The static layer evaluates a skill's design and contract — 25 criteria drawn from ISO/IEC 25010, OpenSSF, and agent-specific practice — across eight key dimensions, producing a score out of 100.
Does what it claims — completely and correctly — for its declared task.
Handles errors, edge cases, and partial inputs gracefully without breaking.
Efficient token and context use; bounded runtime and resource footprint.
A clear, machine-actionable contract an autonomous agent can invoke without ambiguity.
Readable instructions, examples, and outputs a researcher can actually follow.
Safe defaults, input validation, and no leakage of secrets or PHI.
Tool/permission declarations, deterministic triggers, and well-scoped autonomy.
Versioned, modular, documented, and easy to extend or correct over time.
The dynamic layer assesses the skill's actual outputs with layered criteria. The AI automatically generates inputs; the number in each category scales up or down with the skill's complexity. The seven inputs below represent the most comprehensive version of the test set.
The textbook, expected-use case for the skill.
A realistic alternative phrasing or scenario.
A boundary or unusual-but-valid input.
A second distinct realistic scenario.
High-load, large, or highly complex input.
An input at or beyond declared scope — handle or refuse.
Crafted to elicit unsafe, fabricated, or out-of-scope behavior.
| Label | Code / Rank | Definition | Generated Inputs |
|---|---|---|---|
| Simple | S | Narrow task scope | 3 inputs |
| Moderate | M | Moderate branching or multiple task types | 5 inputs |
| Complex | C | Broad or multi-step specialized skill | 7 inputs |
Skills that clear both veto gates receive a final quality score. MedSkillAudit uses a two-stage scoring system — static evaluation (design quality) and dynamic evaluation (runtime performance) — combined into one overall figure that maps directly to a deployment disposition.
Design quality — 8 dimensions, 25 criteria.
Runtime performance on generated medical-task inputs.
Release thresholds map the combined score onto an ordinal disposition — the same scale used by expert reviewers in the validation study.
Eight steps run in order — fail at Step 1 or Step 6, and the evaluation stops cold: the skill doesn't ship.
One flaw, full stop — screens agent skill for fundamental defects and safety risks before they go any further.
Scores 25 criteria across 8 categories (ISO 25010, OpenSSF, agent-specific) → out of 100.
Routes the skill to one of 5 categories and detects its execution mode (A / B / C / D).
Generates 3–7 realistic test inputs, scaled by the skill's complexity.
Runs the skill against each generated input and captures the output.
For medical/research skills — flags data fabrication, diagnostic overreach, and methodology errors. one trigger ends the evaluation.
Produces an eval-viewer markdown report for human inspection of every input.
Calculates the final score and emits P0 / P1 / P2 recommendations plus machine-readable JSON.
During classification, every skill is routed to one of five categories. The four research categories also pass through the Research Veto; the general “Other” category does not.
| # | Category | Scope | Research Veto |
|---|---|---|---|
| 1 | Evidence Insight | Search, databases, critical appraisal, evidence synthesis. | Applies |
| 2 | Protocol Design | Experimental design, study planning, power analysis. | Applies |
| 3 | Data Analysis | Code generation (R / Python), bioinformatics, machine learning. | Applies |
| 4 | Academic Writing | Manuscript drafting, abstracts, methods, cover letters. | Applies |
| 5 | Other | General or non-research skills. | Skipped |
Each audit produces a human-readable review and a machine-readable report for dashboards and tooling.
eval_viewer_<skill>.mdA detailed, human-readable walkthrough with per-input scoring, assertions, and reviewer notes.
eval_report_<skill>_result.jsonA schema-strict JSON record of scores, vetoes, and recommendations — ready for dashboards and tooling.
Yingyong Hou · Xinyuan Lao · Huimei Wang · Qianyu Yao · Wei Chen · Bocheng Huang · Fei Sun · Yuxian Lv · Weiqi Lei · Xueqian Wen · Pengfei Xia · Zhujun Tan · Shengyang Xie
We developed MedSkillAudit ([email protected]), a layered framework assessing skill release readiness before deployment, and evaluated 75 skills across five medical research categories (15 per category). Two experts independently assigned a quality score (0–100), an ordinal release disposition, and a high-risk failure flag. System–expert agreement was quantified using ICC(2,1) and linearly weighted Cohen's kappa, benchmarked against the human inter-rater baseline.
MedSkillAudit achieved ICC(2,1) = 0.449 (95% CI: 0.250–0.610), exceeding the human inter-rater ICC of 0.300, with no directional bias (Wilcoxon p = 0.613). The conclusion: domain-specific pre-deployment audit may provide a practical foundation for governing medical research agent skills, complementing general-purpose quality checks with structured workflows tailored to scientific use cases.
Run MedSkillAudit before you ship — domain-specific audit for medical research agent skills.
aipoch audit ./skills/my-skill --report jsonMedSkillAudit is provided for the sole purpose of assisting scientific research. Its scores and dispositions are not a substitute for professional judgment, and audited skills must be reviewed by a qualified expert before any clinical use.