Agent Skills

Gsva Analysis And Visualization

AIPOCH

Use this skill to run GSVA or ssGSEA pathway-level differential analysis from a bulk expression matrix and a sample group file, then generate a heatmap from the saved GSVA result object. Trigger keywords: GSVA, ssGSEA, pathway enrichment, KEGG pathway analysis, MSigDB. NOT for: gene-level differential expression, single-cell analysis, methylation analysis, clinical diagnosis.

26
0
FILES
gsva-analysis-and-visualization/
skill.md
scripts
cli_options.R
functions.R
io.R
main.R
plot_helpers.R
recording.R
run_analysis.R
utils.R
visualization.R
references
algorithm.md
cli-guide.md
troubleshooting.md
95100Total Score
View Evaluation Report
Core Capability
100 / 100
Functional Suitability
12 / 12
Reliability
12 / 12
Performance & Context
8 / 8
Agent Usability
16 / 16
Human Usability
8 / 8
Security
12 / 12
Maintainability
12 / 12
Agent-Specific
20 / 20
Medical Task
28 / 29 Passed
97Full GSVA pipeline: bulk matrix vs Tumor/Healthy with KEGG gene sets
5/5
94ssGSEA with MSigDB Hallmarks (H category), top 30 pathways
4/4
90GSVA with 3 samples per group — below recommended minimum
3/4
96Visualization reuse from saved GSVA_list.rda with custom heatmap parameters
4/4
94Full GSVA with C2 REACTOME collection, FDR=0.01, top 50 pathways
4/4
83Single-cell RNA-seq pathway analysis request — out of scope
4/4
91Sample names in group file do not match expression matrix columns
4/4

SKILL.md

GSVA Analysis And Visualization

When to Use

Use this skill when the user wants one of the following:

  • Pathway-level GSVA or ssGSEA analysis from a bulk expression matrix plus a sample group file
  • Case-vs-control or treatment-vs-control pathway enrichment comparison using GSVA plus limma
  • KEGG or MSigDB pathway analysis for bulk RNA-seq or microarray-like expression data
  • Heatmap generation from an existing data/GSVA_list.rda result object
  • A reproducible CLI-backed GSVA workflow with saved tables, an .rda object, and a PDF heatmap

Typical request patterns:

  • "Run GSVA on my bulk RNA-seq matrix and compare case vs control"
  • "Use ssGSEA to score pathways and save the pathway differential results"
  • "Generate a KEGG pathway heatmap from my saved GSVA result"
  • "Do pathway enrichment with GSVA for these grouped bulk samples"

Execution Model

This is a hybrid skill.

  1. Use SKILL.md to verify that the request is in scope.
  2. Use scripts/main.R for real execution.
  3. Use --mode analyze to compute pathway scores and differential results.
  4. Use --mode visualize to reuse an existing data/GSVA_list.rda and generate a heatmap. In visualize mode, GSVA_list.rda must exist in output_dir/data/; run analyze or full mode first if it is missing (SKILL_FILE_NOT_FOUND will be raised otherwise).
  5. Use --mode full to run analysis and visualization in one pass.
  6. Read reference files only when you need algorithm details, troubleshooting, or additional CLI examples.

When to Read External Files

SituationFile to ReadPurpose
Need algorithm detailsreferences/algorithm.mdUnderstand GSVA, limma, and heatmap generation logic
Need to run analysis or plottingscripts/main.RExecute the CLI entry point
Encounter errorsreferences/troubleshooting.mdFind standard error codes and fixes
Need more CLI examples or the baseline execution recordreferences/cli-guide.mdCopy ready-to-run commands and review the recorded test run
Need sample input filestests/data/Use the bundled demo matrix and group file

When Not to Use

  • Gene-level differential expression: use differential-expression-analysis instead
  • Single-cell RNA-seq clustering or communication analysis: use sc-clustering or cellchat
  • Immune infiltration scoring rather than pathway enrichment: use ssgsea-r or ssgsea_immune
  • Clinical diagnosis, treatment selection, or patient-specific interpretation: do not use this skill; ask for a validated clinical workflow or human expert review

If the request falls outside these boundaries, stop and tell the user that this skill only covers bulk expression pathway-level GSVA/ssGSEA analysis plus downstream heatmap visualization.

Method Selection Guide

Choose --method based on your data characteristics:

  • gsva: kernel-based enrichment scores suitable for continuous expression data with moderate-to-large sample sizes (≥ 10 samples per group recommended).
  • ssgsea: rank-based enrichment scores; less sensitive to outliers and more suitable for noisy data or smaller sample sizes.

For detailed methodological comparison, READ: references/algorithm.md

Usage

Rscript scripts/main.R \
  --mode full \
  --input_file tests/data/expr_matrix.csv \
  --group_file tests/data/group.csv \
  --case_group Tumor \
  --control_group Healthy \
  --species "Homo sapiens" \
  --category C2 \
  --subcategory KEGG \
  --output_dir ./output \
  --seed 42

Arguments

ShortLongTypeDefaultDescription
-m--modecharacteranalyzeRun mode: analyze, visualize, or full
-i--input_filecharacterrequired for analyze/fullExpression matrix file (CSV or TSV, genes as rows, samples as columns)
-g--group_filecharacterrequired for analyze/fullSample group file (CSV or TSV with sample and group columns)
-a--case_groupcharacterrequired for analyze/fullCase or treatment group label
-c--control_groupcharacterrequired for analyze/fullControl group label
-o--output_dircharacter./output/Output directory
-s--speciescharacterHomo sapiensMSigDB species
-C--categorycharacterC2MSigDB category
-S--subcategorycharacterKEGGMSigDB subcategory
--methodcharactergsvaGSVA method: gsva or ssgsea (see Method Selection Guide above)
--kcdfcharacterGaussianGSVA kernel: Gaussian, Poisson, or none
--min_szinteger2Minimum gene set size
--max_szinteger10000Maximum gene set size
--parallel_szinteger1Parallel worker count passed to GSVA
--mx_difflogicalTRUEGSVA mx.diff flag
--taudouble1GSVA tau value
--fdr_thresholddouble0.05FDR threshold used to select top pathways
--top_ninteger20Number of pathways exported to the top score matrix
--seedinteger42Random seed
--timeout_secondsinteger0Optional timeout in seconds; 0 disables it
--plot_filecharacterGSVA_heatmap.pdfHeatmap file name under plot/ (file name only; no path separators)
--plot_titlecharacterGSVA Enrichment HeatmapHeatmap title
--widthdouble14Heatmap width in inches
--heightdouble8Heatmap height in inches
--colorscharacter#91bfdb,#ffffbf,#fc8d59Comma-separated heatmap colors
--scalecharacternoneHeatmap scale mode: none, row, or column
--cluster_rowslogicalTRUECluster heatmap rows
--cluster_colslogicalFALSECluster heatmap columns
--show_rownameslogicalTRUEShow pathway names on the heatmap
--show_colnameslogicalFALSEShow sample names on the heatmap
--fontsizedouble10Base heatmap font size
--fontsize_rowdouble8Row label font size
--fontsize_coldouble9Column label font size
--legend_cexdouble1Legend text scaling factor
--top_upintegeroptionalNumber of up-regulated pathways retained for plotting
--top_downintegeroptionalNumber of down-regulated pathways retained for plotting
--top_modecharacterbothHeatmap subset mode: both, up, down, or total
--sort_bycharacterFDRPathway ranking: FDR, absLFC, or LFC
--append_statslogicalFALSEAppend FDR and logFC to heatmap labels
--label_max_charsinteger80Maximum heatmap label length

Input Format

Expression Matrix

  • CSV or TSV file
  • First column contains gene identifiers
  • Remaining columns are sample names
  • Values must be numeric and contain no missing values

Example:

gene,S1,S2,S3,S4
TP53,8.1,7.9,6.5,6.3
EGFR,5.2,5.0,4.2,4.1

The bundled tests/data/expr_matrix.csv is derived from the public GEO series GSE44076 after probe-to-gene collapsing and contains the Tumor versus Healthy subset.

Group File

  • CSV or TSV file with a header row
  • One sample column: sample, sample_name, or sample_id
  • One group column: group, condition, cluster, or class
  • Sample names must match the expression matrix columns

Example:

sample,group
GSM1077746,Tumor
GSM1077747,Tumor
GSM1077598,Healthy
GSM1077599,Healthy

Output Files

FileDescription
table/GSVA_diff.csvlimma differential pathway results with logFC, P.Value, and adj.P.Val
table/GSVA_enrichment_results.csvFull GSVA score matrix
table/GSVA_enrichment_results_topN.csvTop pathway score matrix selected by --top_n and --fdr_threshold
data/GSVA_list.rdaSaved gsva_result object for downstream visualization
plot/GSVA_heatmap.pdfHeatmap PDF generated in visualize or full mode
session_info.txtR session and package version information
output_manifest.txtAppend-only manifest of generated outputs across runs in the same output_dir
run_record.txtAppend-only run log with parameters, runtime, and output summaries across runs in the same output_dir

table/GSVA_diff.csv

ColumnTypeDescription
logFCnumericlimma-estimated pathway score difference between case and control
AveExprnumericAverage pathway score across all samples
tnumericModerated t statistic from limma
P.ValuenumericRaw p-value from limma
adj.P.ValnumericBenjamini-Hochberg adjusted p-value
BnumericLog-odds that the pathway is differentially enriched
genesetcharacterPathway identifier used in the GSVA run

Workflow

Step 1: Validate Input

  • Check that the expression matrix and group file exist
  • Validate supported columns and matching sample names
  • Validate CLI ranges and mode-specific required parameters

Step 2: Run Pathway Analysis

  • Load MSigDB gene sets for the requested species and collection
  • Compute GSVA or ssGSEA scores for each sample
  • Fit a limma model for the case-vs-control pathway comparison

Step 3: Generate Output

  • Save the full score matrix, top pathway subset, and differential results to table/
  • Save the reusable gsva_result object to data/GSVA_list.rda
  • Generate the heatmap PDF in plot/ when running visualize or full
  • Append a new section to output_manifest.txt and run_record.txt for each invocation so earlier provenance is preserved when reusing one output_dir

Examples

Basic Usage

Rscript scripts/main.R \
  --mode full \
  --input_file ./expression_matrix.csv \
  --group_file ./group_info.csv \
  --case_group treatment \
  --control_group control \
  --output_dir ./output

With ssGSEA and Custom Parameters

Rscript scripts/main.R \
  --mode analyze \
  --input_file ./expression_matrix.csv \
  --group_file ./group_info.csv \
  --case_group treatment \
  --control_group control \
  --method ssgsea \
  --top_n 30 \
  --fdr_threshold 0.1 \
  --output_dir ./ssgsea_output \
  --seed 123

Reuse a Saved Result Object

Rscript scripts/main.R \
  --mode visualize \
  --output_dir ./output \
  --plot_file custom_heatmap.pdf \
  --top_up 10 \
  --top_down 10 \
  --top_mode both

For the bundled real-data baseline record, READ: references/cli-guide.md

Error Handling

Error CodeMeaningSolution
SKILL_FILE_NOT_FOUNDInput file or saved result file is missing; in visualize mode, GSVA_list.rda must exist in output_dir/data/ — run analyze or full mode firstCheck the path and rerun with the correct file
SKILL_MISSING_COLUMNSGroup file lacks a valid sample or group columnRename the columns to a supported name
SKILL_SAMPLE_MISMATCHSample names do not match between filesAlign sample names before running the skill
SKILL_EMPTY_DATAInput matrix, gene set query, or plotting matrix is emptyVerify the input matrix and MSigDB settings
SKILL_INVALID_PARAMETERA CLI argument is missing or out of rangeReview the parameter table and rerun
SKILL_PACKAGE_NOT_FOUNDRequired R packages are not installedInstall the missing packages listed in references/cli-guide.md

If the error persists, READ: references/troubleshooting.md

Input Validation

This skill accepts:

  • A bulk expression matrix file in CSV or TSV format with genes as rows and samples as columns
  • A sample group file with one supported sample column and one supported group column
  • A valid case/control comparison for pathway-level GSVA or ssGSEA analysis
  • Optional heatmap customization parameters for visualization of a saved GSVA_list.rda

Privacy and data-handling note:

  • If your matrix or group file can be linked to patients or protected records, anonymize it before use
  • This workflow writes result tables, a saved R object, plots, and session metadata to the local output_dir
  • Review local output retention practices before using sensitive material

If the user's request does not involve bulk expression pathway enrichment analysis or GSVA heatmap generation — for example, asking for single-cell analysis, gene-level DE testing, methylation analysis, or clinical diagnosis — do not proceed with this workflow. Instead respond:

"gsva-analysis-and-visualization is designed for bulk expression pathway-level GSVA/ssGSEA analysis and saved-result heatmap visualization. Your request appears to be outside this scope. Please provide a bulk expression matrix plus sample group file for GSVA/ssGSEA analysis, or use a more appropriate skill for your task."

Testing

Rscript scripts/main.R --help

Rscript tests/run_tests.R

Rscript scripts/main.R \
  --mode full \
  --input_file tests/data/expr_matrix.csv \
  --group_file tests/data/group.csv \
  --case_group Tumor \
  --control_group Healthy \
  --species "Homo sapiens" \
  --category C2 \
  --subcategory KEGG \
  --output_dir tests/output \
  --seed 42

Expected outputs:

  • tests/output/table/GSVA_diff.csv
  • tests/output/table/GSVA_enrichment_results.csv
  • tests/output/table/GSVA_enrichment_results_topN.csv
  • tests/output/data/GSVA_list.rda
  • tests/output/plot/GSVA_heatmap.pdf
  • tests/output/session_info.txt
  • tests/output/output_manifest.txt
  • tests/output/run_record.txt

Optional post-check:

Rscript tests/test_skill.R tests/output

tests/run_tests.R executes the full demo workflow, validates the expected output files, then reruns visualize in the same output_dir to confirm that output_manifest.txt and run_record.txt preserve both run sections.

References

  1. Hanzelmann S, Castelo R, Guinney J. (2013) GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics. doi:10.1186/1471-2105-14-7
  2. Ritchie ME, Phipson B, Wu D, et al. (2015) limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research. doi:10.1093/nar/gkv007
  3. Liberzon A, Birger C, Thorvaldsdottir H, et al. (2015) The Molecular Signatures Database Hallmark Gene Set Collection. Cell Systems. doi:10.1016/j.cels.2015.12.004

For detailed algorithm notes, READ: references/algorithm.md

Implementation Checklist

  • CLI parsing with optparse
  • set.seed() for reproducibility
  • Only CRAN/Bioconductor packages
  • Documented parameters match script
  • get_script_dir() defined before any call to it
  • File reading instructions in SKILL.md
  • Test data provided in tests/data/
  • Error handling implemented with SKILL_* messages
  • Rscript scripts/main.R --help works