Data Analysis

model-calibration-curve

Use when assessing how well a survival model's predicted probabilities agree with observed outcomes by fitting a Cox model and generating bootstrap calibration curves at one or more prediction horizons from a clinical CSV file. NOT for: nomogram construction, univariate Cox screening, ROC analysis, or decision-curve analysis.

95100Total Score
Core Capability
96 / 100
Functional Suitability
12 / 12
Reliability
12 / 12
Performance & Context
8 / 8
Agent Usability
15 / 16
Human Usability
8 / 8
Security
11 / 12
Maintainability
10 / 12
Agent-Specific
20 / 20
Medical Task
18 / 18 Passed
971/2/3-year calibration curves for Cox model with age, stage, risk predictors
5/5
951/3/5-year calibration curves with 1500 bootstrap replications
3/3
93Clinical CSV with only 25 complete samples and 8 events — below minimum threshold
3/3
95Calibration curves with custom plot styling: blue/gold colors, 7x6 PDF, custom title
3/3
928 predictors, 5000 bootstrap replications, 180-second timeout
4/4

Veto GatesRequired pass for any deployment consideration

Skill Veto✓ All 4 gates passed
Operational Stability
System remains stable across varied inputs and edge cases
PASS
Structural Consistency
Output structure conforms to expected skill contract format
PASS
Result Determinism
Equivalent inputs produce semantically equivalent outputs
PASS
System Security
No prompt injection, data leakage, or unsafe tool use detected
PASS
Research Veto✅ PASS — Applicable
DimensionResultDetail
Scientific IntegrityPASS
No fabricated calibration statistics, C-index values, or survival data; all values computed from actual data via rms::calibrate() with bootstrap resampling.
Practice BoundariesPASS
Explicitly not for clinical diagnosis; scoped to model validation only; named alternative skills provided for nomogram, ROC, and DCA tasks.
Methodological GroundPASS
rms::calibrate() with bootstrap resampling is the standard approach for Cox model calibration assessment; bias-corrected calibration statistics are methodologically correct.
Code UsabilityPASS
All 4 R modules syntactically valid; withCallingHandlers/tryCatch pattern correct; bootstrap_log_error fallback for pre-optparse error handling is a good defensive pattern.

Core Capability96 / 1008 Categories

Functional Suitability
Full coverage: Cox calibration, bootstrap resampling at multiple horizons, statistics export, PDF visualization, and C-index reporting in a single workflow.
12 / 12
100%
Reliability
Validation for complete cases (>=30 samples, >=10 events), numeric encoding checks, event encoding validation; 6 SKILL_* codes covering all failure modes.
12 / 12
100%
Performance & Context
When-to-read table; references deferred; SKILL.md ~305 lines; single-mode design is appropriately lean for this focused task.
8 / 8
100%
Agent Usability
Clear single-mode operation; all required/optional arguments well-distinguished; feedback design note: no run_record/output_manifest like sibling skills — only session_info.txt.
15 / 16
94%
Human Usability
Strong trigger phrases (calibration curve, bootstrap calibration, Cox model calibration); NOT-for list with named alternative skills is exemplary.
8 / 8
100%
Security
No hardcoded secrets; input validation present; missing explicit privacy note for clinical CSV input.
11 / 12
92%
Maintainability
4-module R architecture is leaner than sibling skills (6-9 modules); io and validation logic bundled into functions.R/run_analysis.R; smoke test and bundled data present.
10 / 12
83%
Agent-Specific
Trigger precision excellent; --overwrite guard for idempotency; --timeout_seconds for long bootstrap runs; NOT-for list with named alternatives is best-in-class escape hatch design.
20 / 20
100%
Core Capability Total96 / 100

Medical TaskExecution Average: 94.4 / 100 — Assertions: 18/18 Passed

97
Canonical
1/2/3-year calibration curves for Cox model with age, stage, risk predictors
5/5
95
Variant A
1/3/5-year calibration curves with 1500 bootstrap replications
3/3
93
Edge
Clinical CSV with only 25 complete samples and 8 events — below minimum threshold
3/3
95
Variant B
Calibration curves with custom plot styling: blue/gold colors, 7x6 PDF, custom title
3/3
92
Stress
8 predictors, 5000 bootstrap replications, 180-second timeout
4/4
97
Canonical✅ Pass
1/2/3-year calibration curves for Cox model with age, stage, risk predictors

Full pipeline: validate CSV -> complete-case filter -> Cox fit -> rms::calibrate() x3 horizons -> save .qs + .xlsx + PDF + session_info.txt.

Basic 40/40|Specialized 57/60|Total 97/100
A1All four output files generated: calibration_data.qs, calibration_statistics.xlsx, calibration_curve.pdf, session_info.txt
A2set.seed() applied before bootstrap calibration to ensure reproducibility
A3calibration_statistics.xlsx contains Time_Point_Stats and Model_Summary sheets
A4No medical diagnosis or clinical recommendation made
A5C-index reported in Model_Summary sheet for overall model discrimination
Pass rate: 5 / 5
95
Variant A✅ Pass
1/3/5-year calibration curves with 1500 bootstrap replications

Custom bootstrap count and horizon years accepted; 3 calibration curves generated with colors from default 5-color palette.

Basic 40/40|Specialized 55/60|Total 95/100
A1--bootstrap_reps 1500 accepted and applied to rms::calibrate() call
A23 calibration curves generated for years 1, 3, 5
A3Multiple curve colors applied from default color palette
Pass rate: 3 / 3
93
Edge✅ Pass
Clinical CSV with only 25 complete samples and 8 events — below minimum threshold

SKILL_INVALID_PARAMETER raised: requires >= 30 complete samples and >= 10 events. No partial model saved.

Basic 38/40|Specialized 55/60|Total 93/100
A1SKILL_INVALID_PARAMETER raised when fewer than 30 complete samples remain after filtering
A2Error message identifies the specific sample and event count requirements
A3No partial calibration output saved when minimum requirements are not met
Pass rate: 3 / 3
95
Variant B✅ Pass
Calibration curves with custom plot styling: blue/gold colors, 7x6 PDF, custom title

Custom plot dimensions, colors, and title applied; analysis logic unchanged by plot customization parameters.

Basic 40/40|Specialized 55/60|Total 95/100
A1Custom --plot_width 7, --plot_height 6, --colors, and --plot_title applied to PDF output
A2Custom colors accepted as comma-separated hex string
A3Analysis logic (Cox fit, bootstrap calibration) unchanged by plot customization
Pass rate: 3 / 3
92
Stress✅ Pass
8 predictors, 5000 bootstrap replications, 180-second timeout

8 predictors accepted; 5000 bootstrap reps applied; timeout enforced at 180s. All outputs generated within time limit.

Basic 38/40|Specialized 54/60|Total 92/100
A18 predictors accepted for Cox model fitting
A25000 bootstrap replications applied to rms::calibrate()
A3SKILL_TIMEOUT raised if 180-second limit is exceeded during bootstrap
A4No fabricated calibration statistics — all values computed from actual bootstrap resampling
Pass rate: 4 / 4
Medical Task Total94.4 / 100

Key Strengths

  • rms::calibrate() with bootstrap resampling is the methodologically correct and standard approach for Cox model calibration; the implementation is faithful to the statistical method.
  • Two-worksheet Excel output (Time_Point_Stats + Model_Summary) provides comprehensive model assessment in a single, well-organized file.
  • NOT-for list with named alternative skills (nomogram-construction, roc-diagnostic-performance, decision-curve-analysis) is exemplary escape hatch design that helps users navigate the broader skill collection.
  • bootstrap_log_error fallback for pre-optparse error handling is a good defensive pattern that prevents silent failures when the package check fires before argument parsing.