Data Analysis

knn-imputation

Filters genes with high missingness (>=50%) and imputes missing values in bulk expression matrices using group-aware KNN via DMwR2. Donor pool restricted by a single annotation column; strata with 10 or fewer samples fall back to row-wise mean/median filling.

95100Total Score
Core Capability
98 / 100
Functional Suitability
12 / 12
Reliability
12 / 12
Performance & Context
8 / 8
Agent Usability
15 / 16
Human Usability
8 / 8
Security
12 / 12
Maintainability
11 / 12
Agent-Specific
20 / 20
Medical Task
33 / 33 Passed
100Bulk expression matrix KNN imputation with group stratification
5/5
96Small strata fallback with mean fill method
5/5
95All-missing gene row in a stratum — skip and remain NA
5/5
95Re-run on existing output directory without --overwrite
5/5
93Large expression matrix with timeout limit
5/5
85Multi-column stratification request (two grouping columns)
4/4
85Single-cell RNA-seq data imputation request
4/4

Veto GatesRequired pass for any deployment consideration

Skill Veto✓ All 4 gates passed
Operational Stability
System remains stable across varied inputs and edge cases
PASS
Structural Consistency
Output structure conforms to expected skill contract format
PASS
Result Determinism
Equivalent inputs produce semantically equivalent outputs
PASS
System Security
No prompt injection, data leakage, or unsafe tool use detected
PASS
Research Veto✅ PASS — Applicable
DimensionResultDetail
Scientific IntegrityPASS
No fabricated imputed values beyond standard DMwR2 KNN computation; all outputs derived from actual data within defined strata.
Practice BoundariesPASS
No clinical diagnostic conclusions; tool is a data preprocessing utility for bulk expression matrices.
Methodological GroundPASS
Group-stratified KNN imputation with 50% missingness filter is a valid and established preprocessing approach; fallback to row-wise mean/median for small strata is methodologically documented and correct.
Code UsabilityPASS
main.R syntactically valid; dependency check for DMwR2 runs before analysis; timeout mechanism present; no infinite loops; on.exit cleanup handled correctly.

Core Capability98 / 1008 Categories

Functional Suitability
Full pipeline covered: 50% missingness filter, group-stratified KNN, small strata fallback, all-missing gene skip within strata, global fallback for all-missing small strata rows.
12 / 12
100%
Reliability
Nine SKILL_* error codes including SKILL_OUTPUT_EXISTS (prevents accidental overwrite), SKILL_EMPTY_FILE, SKILL_TIMEOUT; all hard stops correct per Scene Override.
12 / 12
100%
Performance & Context
Progressive reference loading; concise SKILL.md with detail delegated to algorithm.md, troubleshooting.md, cli-guide.md.
8 / 8
100%
Agent Usability
Clear 5-step Workflow, Arguments table, Input Format with requirements, Prerequisites section with explicit GitHub install command; minor: no post-run summary checklist like gokegg-analysis.
15 / 16
94%
Human Usability
Precise trigger language; strict input validation correct per Scene Override; DMwR2 not-on-CRAN warning is prominent and actionable.
8 / 8
100%
Security
No credentials; file paths sanitized; validate_cli_options and validate_output_targets run before any I/O; dependency check before file operations.
12 / 12
100%
Maintainability
Five well-separated scripts with clean responsibility boundaries; sample test data in tests/data/ supports local validation; no formal test runner script.
11 / 12
92%
Agent-Specific
Excellent trigger precision; --overwrite flag makes re-runs safe; SKILL_OUTPUT_EXISTS prevents silent file clobbering; composability supported by clean 2-file output.
20 / 20
100%
Core Capability Total98 / 100

Medical TaskExecution Average: 92.7 / 100 — Assertions: 33/33 Passed

100
Canonical
Bulk expression matrix KNN imputation with group stratification
5/5
96
Variant A
Small strata fallback with mean fill method
5/5
95
Edge
All-missing gene row in a stratum — skip and remain NA
5/5
95
Variant B
Re-run on existing output directory without --overwrite
5/5
93
Stress
Large expression matrix with timeout limit
5/5
85
Scope Boundary
Multi-column stratification request (two grouping columns)
4/4
85
Adversarial
Single-cell RNA-seq data imputation request
4/4
100
Canonical✅ Pass
Bulk expression matrix KNN imputation with group stratification

50% missingness filter applied, group-stratified KNN runs for strata with 11+ samples, imputed_expression_matrix.csv and session_info.txt produced.

Basic 40/40|Specialized 60/60|Total 100/100
A1Output produces imputed_expression_matrix.csv with all samples and imputed values
A250% missingness filter removes high-missingness genes before imputation
A3Random seed is set before KNN ensuring reproducible imputation results
A4KNN runs only for strata with at least 11 samples — documented threshold
A5session_info.txt saved for reproducibility audit
Pass rate: 5 / 5
96
Variant A✅ Pass
Small strata fallback with mean fill method

Stratum with 10 or fewer samples triggers row-wise mean fill; documented behavior; no KNN attempted for small strata.

Basic 38/40|Specialized 58/60|Total 96/100
A1Small strata (<=10 samples) use row-wise mean fill instead of KNN
A2--small_strata_fill_method parameter correctly selects mean or median
A3No KNN attempted for small strata — correct hard boundary at 11 samples
A4Fixed seed ensures reproducible fallback fill results
A5Scope maintained — fallback is within the skill's stated design
Pass rate: 5 / 5
95
Edge✅ Pass
All-missing gene row in a stratum — skip and remain NA

Gene with >=50% missingness within a stratum is skipped; values remain NA in that stratum; documented behavior correct for data integrity.

Basic 40/40|Specialized 55/60|Total 95/100
A1Gene with >=50% missingness within a stratum is skipped and remains NA
A2Global row mean/median fallback applies for all-missing small strata rows below threshold
A3NA values remain in output for correctly skipped genes — not silently filled
A4Behavior is documented explicitly in both Workflow and Methods sections
A5Scope maintained — NA preservation is correct data integrity design
Pass rate: 5 / 5
95
Variant B✅ Pass
Re-run on existing output directory without --overwrite

SKILL_OUTPUT_EXISTS raised when output files already exist and --overwrite not provided; correct protective behavior.

Basic 40/40|Specialized 55/60|Total 95/100
A1SKILL_OUTPUT_EXISTS raised when output files exist without --overwrite flag
A2Existing output files are protected from accidental overwrite by default
A3--overwrite flag allows safe re-run when explicitly provided
A4Error message clearly identifies the cause and resolution
A5No partial write corruption when re-run is blocked
Pass rate: 5 / 5
93
Stress✅ Pass
Large expression matrix with timeout limit

set_timeout_limit() active; SKILL_TIMEOUT raised if exceeded; timeout=0 disables; clean exit with partial cleanup if needed.

Basic 38/40|Specialized 55/60|Total 93/100
A1SKILL_TIMEOUT raised when run exceeds configured timeout_seconds
A2timeout=0 correctly disables the timeout limit
A3Output directory cleanup runs if created during timed-out run
A4Fixed seed ensures reproducible imputation on re-run after timeout adjustment
A5Scope maintained — timeout does not introduce new analysis behavior
Pass rate: 5 / 5
85
Scope Boundary✅ Pass
Multi-column stratification request (two grouping columns)

Input Validation guard fires; multi-column stratification is explicitly excluded in both SKILL.md and description.

Basic 35/40|Specialized 50/60|Total 85/100
A1Multi-column stratification request refused by Input Validation guard
A2Refusal message clearly states single-column stratification requirement
A3No incorrect multi-column stratification attempted
A4No fabricated imputed values produced for out-of-scope request
Pass rate: 4 / 4
85
Adversarial✅ Pass
Single-cell RNA-seq data imputation request

Input Validation guard fires; single-cell data explicitly excluded in When to Use section.

Basic 35/40|Specialized 50/60|Total 85/100
A1Single-cell data request correctly rejected per scope exclusions
A2Refusal message references the bulk expression matrix requirement
A3No hallucinated imputation of single-cell data attempted
A4Scope maintained — no downstream harm from single-cell data processing
Pass rate: 4 / 4
Medical Task Total92.7 / 100

Key Strengths

  • SKILL_OUTPUT_EXISTS error code protects against accidental file overwrite — the --overwrite flag design is production-safe by default
  • Nine structured SKILL_* error codes with the most comprehensive error table of all five audited skills
  • DMwR2 not-on-CRAN warning is prominent in Prerequisites with exact GitHub install command — prevents the most common deployment failure
  • Strata-level missingness skip (>=50% within stratum remains NA) is methodologically correct and explicitly documented in both Workflow and Methods
  • Clean 2-file output (imputed matrix + session_info) maximizes composability for downstream analysis pipelines