Data Analysis

model-calibration-curve

Use when assessing how well a survival model's predicted probabilities agree with observed outcomes by fitting a Cox model and generating bootstrap calibration curves at one or more prediction horizons from a clinical CSV file. NOT for: nomogram construction, univariate Cox screening, ROC analysis, or decision-curve analysis.

95100Total Score

Core Capability

96 / 100

Functional Suitability

12 / 12

Reliability

12 / 12

Performance & Context

8 / 8

Agent Usability

15 / 16

Human Usability

8 / 8

Security

11 / 12

Maintainability

10 / 12

Agent-Specific

20 / 20

Medical Task

18 / 18 Passed

971/2/3-year calibration curves for Cox model with age, stage, risk predictors

5/5

951/3/5-year calibration curves with 1500 bootstrap replications

3/3

93Clinical CSV with only 25 complete samples and 8 events — below minimum threshold

3/3

95Calibration curves with custom plot styling: blue/gold colors, 7x6 PDF, custom title

3/3

928 predictors, 5000 bootstrap replications, 180-second timeout

4/4

Veto GatesRequired pass for any deployment consideration

Skill Veto✓ All 4 gates passed

✓

Operational Stability

System remains stable across varied inputs and edge cases

PASS

✓

Structural Consistency

Output structure conforms to expected skill contract format

PASS

✓

Result Determinism

Equivalent inputs produce semantically equivalent outputs

PASS

✓

System Security

No prompt injection, data leakage, or unsafe tool use detected

PASS

Research Veto✅ PASS — Applicable

Dimension	Result	Detail
Scientific Integrity	PASS	No fabricated calibration statistics, C-index values, or survival data; all values computed from actual data via rms::calibrate() with bootstrap resampling.
Practice Boundaries	PASS	Explicitly not for clinical diagnosis; scoped to model validation only; named alternative skills provided for nomogram, ROC, and DCA tasks.
Methodological Ground	PASS	rms::calibrate() with bootstrap resampling is the standard approach for Cox model calibration assessment; bias-corrected calibration statistics are methodologically correct.
Code Usability	PASS	All 4 R modules syntactically valid; withCallingHandlers/tryCatch pattern correct; bootstrap_log_error fallback for pre-optparse error handling is a good defensive pattern.

Core Capability96 / 100 — 8 Categories

Functional Suitability

Full coverage: Cox calibration, bootstrap resampling at multiple horizons, statistics export, PDF visualization, and C-index reporting in a single workflow.

12 / 12

100%

Reliability

Validation for complete cases (>=30 samples, >=10 events), numeric encoding checks, event encoding validation; 6 SKILL_* codes covering all failure modes.

12 / 12

100%

Performance & Context

When-to-read table; references deferred; SKILL.md ~305 lines; single-mode design is appropriately lean for this focused task.

8 / 8

100%

Agent Usability

Clear single-mode operation; all required/optional arguments well-distinguished; feedback design note: no run_record/output_manifest like sibling skills — only session_info.txt.

15 / 16

94%

Human Usability

Strong trigger phrases (calibration curve, bootstrap calibration, Cox model calibration); NOT-for list with named alternative skills is exemplary.

8 / 8

100%

Security

No hardcoded secrets; input validation present; missing explicit privacy note for clinical CSV input.

11 / 12

92%

Maintainability

4-module R architecture is leaner than sibling skills (6-9 modules); io and validation logic bundled into functions.R/run_analysis.R; smoke test and bundled data present.

10 / 12

83%

Agent-Specific

Trigger precision excellent; --overwrite guard for idempotency; --timeout_seconds for long bootstrap runs; NOT-for list with named alternatives is best-in-class escape hatch design.

20 / 20

100%

Core Capability Total96 / 100

Medical TaskExecution Average: 94.4 / 100 — Assertions: 18/18 Passed

Canonical

1/2/3-year calibration curves for Cox model with age, stage, risk predictors

5/5 ✓

Variant A

1/3/5-year calibration curves with 1500 bootstrap replications

3/3 ✓

Edge

Clinical CSV with only 25 complete samples and 8 events — below minimum threshold

3/3 ✓

Variant B

Calibration curves with custom plot styling: blue/gold colors, 7x6 PDF, custom title

3/3 ✓

Stress

8 predictors, 5000 bootstrap replications, 180-second timeout

4/4 ✓

Canonical✅ Pass

1/2/3-year calibration curves for Cox model with age, stage, risk predictors

Full pipeline: validate CSV -> complete-case filter -> Cox fit -> rms::calibrate() x3 horizons -> save .qs + .xlsx + PDF + session_info.txt.

Basic 40/40|Specialized 57/60|Total 97/100

✅A1All four output files generated: calibration_data.qs, calibration_statistics.xlsx, calibration_curve.pdf, session_info.txt

✅A2set.seed() applied before bootstrap calibration to ensure reproducibility

✅A3calibration_statistics.xlsx contains Time_Point_Stats and Model_Summary sheets

✅A4No medical diagnosis or clinical recommendation made

✅A5C-index reported in Model_Summary sheet for overall model discrimination

Pass rate: 5 / 5

Variant A✅ Pass

1/3/5-year calibration curves with 1500 bootstrap replications

Custom bootstrap count and horizon years accepted; 3 calibration curves generated with colors from default 5-color palette.

Basic 40/40|Specialized 55/60|Total 95/100

✅A1--bootstrap_reps 1500 accepted and applied to rms::calibrate() call

✅A23 calibration curves generated for years 1, 3, 5

✅A3Multiple curve colors applied from default color palette

Pass rate: 3 / 3

Edge✅ Pass

Clinical CSV with only 25 complete samples and 8 events — below minimum threshold

SKILL_INVALID_PARAMETER raised: requires >= 30 complete samples and >= 10 events. No partial model saved.

Basic 38/40|Specialized 55/60|Total 93/100

✅A1SKILL_INVALID_PARAMETER raised when fewer than 30 complete samples remain after filtering

✅A2Error message identifies the specific sample and event count requirements

✅A3No partial calibration output saved when minimum requirements are not met

Pass rate: 3 / 3

Variant B✅ Pass

Calibration curves with custom plot styling: blue/gold colors, 7x6 PDF, custom title

Custom plot dimensions, colors, and title applied; analysis logic unchanged by plot customization parameters.

Basic 40/40|Specialized 55/60|Total 95/100

✅A1Custom --plot_width 7, --plot_height 6, --colors, and --plot_title applied to PDF output

✅A2Custom colors accepted as comma-separated hex string

✅A3Analysis logic (Cox fit, bootstrap calibration) unchanged by plot customization

Pass rate: 3 / 3

Stress✅ Pass

8 predictors, 5000 bootstrap replications, 180-second timeout

8 predictors accepted; 5000 bootstrap reps applied; timeout enforced at 180s. All outputs generated within time limit.

Basic 38/40|Specialized 54/60|Total 92/100

✅A18 predictors accepted for Cox model fitting

✅A25000 bootstrap replications applied to rms::calibrate()

✅A3SKILL_TIMEOUT raised if 180-second limit is exceeded during bootstrap

✅A4No fabricated calibration statistics — all values computed from actual bootstrap resampling

Pass rate: 4 / 4

Medical Task Total94.4 / 100

Key Strengths

rms::calibrate() with bootstrap resampling is the methodologically correct and standard approach for Cox model calibration; the implementation is faithful to the statistical method.
Two-worksheet Excel output (Time_Point_Stats + Model_Summary) provides comprehensive model assessment in a single, well-organized file.
NOT-for list with named alternative skills (nomogram-construction, roc-diagnostic-performance, decision-curve-analysis) is exemplary escape hatch design that helps users navigate the broader skill collection.
bootstrap_log_error fallback for pre-optparse error handling is a good defensive pattern that prevents silent failures when the package check fires before argument parsing.