GSVA Analysis And Visualization

When to Use

Use this skill when the user wants one of the following:

Pathway-level GSVA or ssGSEA analysis from a bulk expression matrix plus a sample group file
Case-vs-control or treatment-vs-control pathway enrichment comparison using GSVA plus limma
KEGG or MSigDB pathway analysis for bulk RNA-seq or microarray-like expression data
Heatmap generation from an existing data/GSVA_list.rda result object
A reproducible CLI-backed GSVA workflow with saved tables, an .rda object, and a PDF heatmap

Typical request patterns:

"Run GSVA on my bulk RNA-seq matrix and compare case vs control"
"Use ssGSEA to score pathways and save the pathway differential results"
"Generate a KEGG pathway heatmap from my saved GSVA result"
"Do pathway enrichment with GSVA for these grouped bulk samples"

Execution Model

This is a hybrid skill.

Use SKILL.md to verify that the request is in scope.
Use scripts/main.R for real execution.
Use --mode analyze to compute pathway scores and differential results.
Use --mode visualize to reuse an existing data/GSVA_list.rda and generate a heatmap. In visualize mode, GSVA_list.rda must exist in output_dir/data/; run analyze or full mode first if it is missing (SKILL_FILE_NOT_FOUND will be raised otherwise).
Use --mode full to run analysis and visualization in one pass.
Read reference files only when you need algorithm details, troubleshooting, or additional CLI examples.

When to Read External Files

Situation	File to Read	Purpose
Need algorithm details	`references/algorithm.md`	Understand GSVA, limma, and heatmap generation logic
Need to run analysis or plotting	`scripts/main.R`	Execute the CLI entry point
Encounter errors	`references/troubleshooting.md`	Find standard error codes and fixes
Need more CLI examples or the baseline execution record	`references/cli-guide.md`	Copy ready-to-run commands and review the recorded test run
Need sample input files	`tests/data/`	Use the bundled demo matrix and group file

When Not to Use

Gene-level differential expression: use differential-expression-analysis instead
Single-cell RNA-seq clustering or communication analysis: use sc-clustering or cellchat
Immune infiltration scoring rather than pathway enrichment: use ssgsea-r or ssgsea_immune
Clinical diagnosis, treatment selection, or patient-specific interpretation: do not use this skill; ask for a validated clinical workflow or human expert review

If the request falls outside these boundaries, stop and tell the user that this skill only covers bulk expression pathway-level GSVA/ssGSEA analysis plus downstream heatmap visualization.

Method Selection Guide

Choose --method based on your data characteristics:

gsva: kernel-based enrichment scores suitable for continuous expression data with moderate-to-large sample sizes (≥ 10 samples per group recommended).
ssgsea: rank-based enrichment scores; less sensitive to outliers and more suitable for noisy data or smaller sample sizes.

For detailed methodological comparison, READ: references/algorithm.md

Usage

Rscript scripts/main.R \
  --mode full \
  --input_file tests/data/expr_matrix.csv \
  --group_file tests/data/group.csv \
  --case_group Tumor \
  --control_group Healthy \
  --species "Homo sapiens" \
  --category C2 \
  --subcategory KEGG \
  --output_dir ./output \
  --seed 42

Arguments

Short	Long	Type	Default	Description
`-m`	`--mode`	character	`analyze`	Run mode: `analyze`, `visualize`, or `full`
`-i`	`--input_file`	character	required for `analyze`/`full`	Expression matrix file (CSV or TSV, genes as rows, samples as columns)
`-g`	`--group_file`	character	required for `analyze`/`full`	Sample group file (CSV or TSV with sample and group columns)
`-a`	`--case_group`	character	required for `analyze`/`full`	Case or treatment group label
`-c`	`--control_group`	character	required for `analyze`/`full`	Control group label
`-o`	`--output_dir`	character	`./output/`	Output directory
`-s`	`--species`	character	`Homo sapiens`	MSigDB species
`-C`	`--category`	character	`C2`	MSigDB category
`-S`	`--subcategory`	character	`KEGG`	MSigDB subcategory
	`--method`	character	`gsva`	GSVA method: `gsva` or `ssgsea` (see Method Selection Guide above)
	`--kcdf`	character	`Gaussian`	GSVA kernel: `Gaussian`, `Poisson`, or `none`
	`--min_sz`	integer	`2`	Minimum gene set size
	`--max_sz`	integer	`10000`	Maximum gene set size
	`--parallel_sz`	integer	`1`	Parallel worker count passed to GSVA
	`--mx_diff`	logical	`TRUE`	GSVA `mx.diff` flag
	`--tau`	double	`1`	GSVA `tau` value
	`--fdr_threshold`	double	`0.05`	FDR threshold used to select top pathways
	`--top_n`	integer	`20`	Number of pathways exported to the top score matrix
	`--seed`	integer	`42`	Random seed
	`--timeout_seconds`	integer	`0`	Optional timeout in seconds; `0` disables it
	`--plot_file`	character	`GSVA_heatmap.pdf`	Heatmap file name under `plot/` (file name only; no path separators)
	`--plot_title`	character	`GSVA Enrichment Heatmap`	Heatmap title
	`--width`	double	`14`	Heatmap width in inches
	`--height`	double	`8`	Heatmap height in inches
	`--colors`	character	`#91bfdb,#ffffbf,#fc8d59`	Comma-separated heatmap colors
	`--scale`	character	`none`	Heatmap scale mode: `none`, `row`, or `column`
	`--cluster_rows`	logical	`TRUE`	Cluster heatmap rows
	`--cluster_cols`	logical	`FALSE`	Cluster heatmap columns
	`--show_rownames`	logical	`TRUE`	Show pathway names on the heatmap
	`--show_colnames`	logical	`FALSE`	Show sample names on the heatmap
	`--fontsize`	double	`10`	Base heatmap font size
	`--fontsize_row`	double	`8`	Row label font size
	`--fontsize_col`	double	`9`	Column label font size
	`--legend_cex`	double	`1`	Legend text scaling factor
	`--top_up`	integer	optional	Number of up-regulated pathways retained for plotting
	`--top_down`	integer	optional	Number of down-regulated pathways retained for plotting
	`--top_mode`	character	`both`	Heatmap subset mode: `both`, `up`, `down`, or `total`
	`--sort_by`	character	`FDR`	Pathway ranking: `FDR`, `absLFC`, or `LFC`
	`--append_stats`	logical	`FALSE`	Append `FDR` and `logFC` to heatmap labels
	`--label_max_chars`	integer	`80`	Maximum heatmap label length

Input Format

Expression Matrix

CSV or TSV file
First column contains gene identifiers
Remaining columns are sample names
Values must be numeric and contain no missing values

Example:

gene,S1,S2,S3,S4
TP53,8.1,7.9,6.5,6.3
EGFR,5.2,5.0,4.2,4.1

The bundled tests/data/expr_matrix.csv is derived from the public GEO series GSE44076 after probe-to-gene collapsing and contains the Tumor versus Healthy subset.

Group File

CSV or TSV file with a header row
One sample column: sample, sample_name, or sample_id
One group column: group, condition, cluster, or class
Sample names must match the expression matrix columns

Example:

sample,group
GSM1077746,Tumor
GSM1077747,Tumor
GSM1077598,Healthy
GSM1077599,Healthy

Output Files

File	Description
`table/GSVA_diff.csv`	limma differential pathway results with `logFC`, `P.Value`, and `adj.P.Val`
`table/GSVA_enrichment_results.csv`	Full GSVA score matrix
`table/GSVA_enrichment_results_topN.csv`	Top pathway score matrix selected by `--top_n` and `--fdr_threshold`
`data/GSVA_list.rda`	Saved `gsva_result` object for downstream visualization
`plot/GSVA_heatmap.pdf`	Heatmap PDF generated in `visualize` or `full` mode
`session_info.txt`	R session and package version information
`output_manifest.txt`	Append-only manifest of generated outputs across runs in the same `output_dir`
`run_record.txt`	Append-only run log with parameters, runtime, and output summaries across runs in the same `output_dir`

table/GSVA_diff.csv

Column	Type	Description
`logFC`	numeric	limma-estimated pathway score difference between case and control
`AveExpr`	numeric	Average pathway score across all samples
`t`	numeric	Moderated t statistic from limma
`P.Value`	numeric	Raw p-value from limma
`adj.P.Val`	numeric	Benjamini-Hochberg adjusted p-value
`B`	numeric	Log-odds that the pathway is differentially enriched
`geneset`	character	Pathway identifier used in the GSVA run

Workflow

Step 1: Validate Input

Check that the expression matrix and group file exist
Validate supported columns and matching sample names
Validate CLI ranges and mode-specific required parameters

Step 2: Run Pathway Analysis

Load MSigDB gene sets for the requested species and collection
Compute GSVA or ssGSEA scores for each sample
Fit a limma model for the case-vs-control pathway comparison

Step 3: Generate Output

Save the full score matrix, top pathway subset, and differential results to table/
Save the reusable gsva_result object to data/GSVA_list.rda
Generate the heatmap PDF in plot/ when running visualize or full
Append a new section to output_manifest.txt and run_record.txt for each invocation so earlier provenance is preserved when reusing one output_dir

Examples

Basic Usage

Rscript scripts/main.R \
  --mode full \
  --input_file ./expression_matrix.csv \
  --group_file ./group_info.csv \
  --case_group treatment \
  --control_group control \
  --output_dir ./output

With ssGSEA and Custom Parameters

Rscript scripts/main.R \
  --mode analyze \
  --input_file ./expression_matrix.csv \
  --group_file ./group_info.csv \
  --case_group treatment \
  --control_group control \
  --method ssgsea \
  --top_n 30 \
  --fdr_threshold 0.1 \
  --output_dir ./ssgsea_output \
  --seed 123

Reuse a Saved Result Object

Rscript scripts/main.R \
  --mode visualize \
  --output_dir ./output \
  --plot_file custom_heatmap.pdf \
  --top_up 10 \
  --top_down 10 \
  --top_mode both

For the bundled real-data baseline record, READ: references/cli-guide.md

Error Handling

Error Code	Meaning	Solution
`SKILL_FILE_NOT_FOUND`	Input file or saved result file is missing; in visualize mode, `GSVA_list.rda` must exist in `output_dir/data/` — run analyze or full mode first	Check the path and rerun with the correct file
`SKILL_MISSING_COLUMNS`	Group file lacks a valid sample or group column	Rename the columns to a supported name
`SKILL_SAMPLE_MISMATCH`	Sample names do not match between files	Align sample names before running the skill
`SKILL_EMPTY_DATA`	Input matrix, gene set query, or plotting matrix is empty	Verify the input matrix and MSigDB settings
`SKILL_INVALID_PARAMETER`	A CLI argument is missing or out of range	Review the parameter table and rerun
`SKILL_PACKAGE_NOT_FOUND`	Required R packages are not installed	Install the missing packages listed in `references/cli-guide.md`

If the error persists, READ: references/troubleshooting.md

Input Validation

This skill accepts:

A bulk expression matrix file in CSV or TSV format with genes as rows and samples as columns
A sample group file with one supported sample column and one supported group column
A valid case/control comparison for pathway-level GSVA or ssGSEA analysis
Optional heatmap customization parameters for visualization of a saved GSVA_list.rda

Privacy and data-handling note:

If your matrix or group file can be linked to patients or protected records, anonymize it before use
This workflow writes result tables, a saved R object, plots, and session metadata to the local output_dir
Review local output retention practices before using sensitive material

If the user's request does not involve bulk expression pathway enrichment analysis or GSVA heatmap generation — for example, asking for single-cell analysis, gene-level DE testing, methylation analysis, or clinical diagnosis — do not proceed with this workflow. Instead respond:

"gsva-analysis-and-visualization is designed for bulk expression pathway-level GSVA/ssGSEA analysis and saved-result heatmap visualization. Your request appears to be outside this scope. Please provide a bulk expression matrix plus sample group file for GSVA/ssGSEA analysis, or use a more appropriate skill for your task."

Testing

Rscript scripts/main.R --help

Rscript tests/run_tests.R

Rscript scripts/main.R \
  --mode full \
  --input_file tests/data/expr_matrix.csv \
  --group_file tests/data/group.csv \
  --case_group Tumor \
  --control_group Healthy \
  --species "Homo sapiens" \
  --category C2 \
  --subcategory KEGG \
  --output_dir tests/output \
  --seed 42

Expected outputs:

tests/output/table/GSVA_diff.csv
tests/output/table/GSVA_enrichment_results.csv
tests/output/table/GSVA_enrichment_results_topN.csv
tests/output/data/GSVA_list.rda
tests/output/plot/GSVA_heatmap.pdf
tests/output/session_info.txt
tests/output/output_manifest.txt
tests/output/run_record.txt

Optional post-check:

Rscript tests/test_skill.R tests/output

tests/run_tests.R executes the full demo workflow, validates the expected output files, then reruns visualize in the same output_dir to confirm that output_manifest.txt and run_record.txt preserve both run sections.

References

Hanzelmann S, Castelo R, Guinney J. (2013) GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics. doi:10.1186/1471-2105-14-7
Ritchie ME, Phipson B, Wu D, et al. (2015) limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research. doi:10.1093/nar/gkv007
Liberzon A, Birger C, Thorvaldsdottir H, et al. (2015) The Molecular Signatures Database Hallmark Gene Set Collection. Cell Systems. doi:10.1016/j.cels.2015.12.004

For detailed algorithm notes, READ: references/algorithm.md

Implementation Checklist

CLI parsing with optparse
set.seed() for reproducibility
Only CRAN/Bioconductor packages
Documented parameters match script
get_script_dir() defined before any call to it
File reading instructions in SKILL.md
Test data provided in tests/data/
Error handling implemented with SKILL_* messages
Rscript scripts/main.R --help works

Gsva Analysis And Visualization