Documentation - Sig2Drug

Overview

Sig2Drug identifies potential therapeutic compounds by comparing your disease gene expression signature against the integrated database of drug-induced expression profiles (Tahoe-100M, SciPlex, PANACEA, MixSeq, DRUG-seq, DrugReflector, and other). The goal is to find drugs that can reverse the disease signature - upregulating genes that are downregulated in disease, and vice versa.

The quality of your results depends heavily on the quality of your input signature. This documentation will help you prepare optimal input data from various experimental sources.

Key Principle

The more accurately your input reflects true differential expression (not just fold change magnitude), the better your drug reversal results will be. Statistical significance matters more than effect size.

Quick Start

Choose your input type based on what data you have available:

You have...	Use this input	Quality
DESeq2/limma/edgeR results with statistics	Ranked Gene List	Best
Single-cell DE results (Wilcoxon, t-test, MAST)	Ranked Gene List	Best
Raw expression matrix (genes × samples)	Count Matrix	Convenient
Only gene lists (up/down regulated)	Up/Down Genes	Basic

Input Types Comparison

Ranked Gene List

Recommended

Provide pre-computed differential expression results as a ranked list of genes with associated statistics. This is the recommended option because you control the DE analysis pipeline and can use the most appropriate method for your data type.

Format Requirements

Two columns: gene symbol and statistic
Positive values = upregulated in disease
Negative values = downregulated in disease
Supports: TSV, CSV, semicolon-separated, or Excel (.xlsx)

gene,statistic S100A8,8.92 S100A9,7.85 CXCL8,6.73 ... MUC2,-5.34 CDH1,-6.12 FABP1,-7.45

What is the "statistic" column?

The statistic is a numerical value that captures both the direction and confidence of differential expression for each gene. It tells Sig2Drug not just whether a gene is up or down, but how strongly and reliably it changed. The sign indicates direction (positive = upregulated in disease, negative = downregulated), and the magnitude indicates confidence.

Your pipeline	What to use as statistic	How to extract it
DESeq2	Wald statistic	`res$stat`
limma / limma-voom	Moderated t-statistic	`topTable(fit)$t`
edgeR	Signed −log10(p)	`sign(logFC) * -log10(PValue)`
Seurat (Wilcoxon)	Signed −log10(p)	`sign(avg_log2FC) * -log10(p_val)`
Seurat (t-test)	t-statistic	`markers$avg_diff` (approximates t)
Scanpy	Signed −log10(p)	`sign(logfoldchanges) * -log10(pvals)`

Avoid using log2 fold change alone

Log2FC measures only magnitude, not reliability. A gene with log2FC = 5 but high variance is less informative than a gene with log2FC = 2 and low variance. The t-statistic or Wald statistic captures both effect size AND variance, producing more accurate drug matching. See the Statistics Recommendations section for details.

Count Matrix

Convenient

Upload your raw or normalized expression matrix, and Sig2Drug will perform differential expression analysis automatically. Convenient if you don't have access to R/Python DE pipelines, but for best results we recommend running your own DE analysis and submitting a Ranked Gene List.

Format Requirements

First column: Gene symbols (HGNC)
Remaining columns: Sample expression values
Column names must contain control for control samples (case-insensitive)
All other columns are treated as disease/condition samples
Supports: TSV, CSV, semicolon-separated, or Excel (.xlsx)

gene Disease_1 Disease_2 Disease_3 Control_1 Control_2 Control_3 BRCA1 234.5 245.2 228.9 89.3 92.1 87.6 TP53 567.8 589.4 545.1 234.2 221.8 228.5 MYC 123.4 134.2 118.7 456.7 467.3 448.9

What happens internally

Sig2Drug automatically detects raw count data and applies TMM normalization with limma-voom for proper differential expression analysis. For pre-normalized data (e.g., microarray, log-scale), standard limma is used directly. The moderated t-statistic is used for ranking.

Up/Down Gene Lists

Basic

Provide two separate lists: upregulated and downregulated genes. This is the simplest input format but provides less information for drug matching since ranking information is lost.

Format Requirements

Two separate text boxes: UP genes and DOWN genes
One gene symbol per line
Minimum 25 genes in each list
Order within each list doesn't matter

↑ Upregulated

S100A8 S100A9 CXCL8 IL1B MMP9 ...

↓ Downregulated

MUC2 CDH1 FABP1 TFF3 CA2 ...

Limitations

Only KS and XSum methods can be calculated with this input type. RGES requires continuous statistics. Consider using Ranked List or Count Matrix for better results.

Statistics Recommendations

The choice of statistic significantly impacts the quality of drug reversal. Here's a comprehensive guide for different analysis pipelines:

Statistic	Source	Recommendation	Notes
t-statistic	limma, t-test	Best	Accounts for variance, robust to outliers
Wald statistic	DESeq2	Best	Equivalent to t-statistic for RNA-seq
Moderated t	limma eBayes	Best	Borrows information across genes
Signed -log10(p)	Any DE tool	Good	sign(logFC) × -log10(p-value)
Wilcoxon W	Seurat, Scanpy	Good	Good for single-cell data
Z-score	edgeR, various	Good	Normalized test statistic
Log2 Fold Change	Any	Avoid	Doesn't account for variance or sample size

Why avoid log2 fold change?

Log2 fold change only measures the magnitude of change, not the reliability. A gene with log2FC = 5 but high variance is less reliable than a gene with log2FC = 2 and low variance. The t-statistic captures both effect size AND variance, giving you a more accurate ranking.

Bulk RNA-seq

For bulk RNA-seq data, we recommend using established DE pipelines. Here are examples for the most common tools:

DESeq2 (Recommended)

library(DESeq2)

# Create DESeq2 object
dds <- DESeqDataSetFromMatrix(countData = counts,
                              colData = metadata,
                              design = ~ condition)
dds <- DESeq(dds)
res <- results(dds)

# Extract Wald statistic for Sig2Drug
sig2drug_input <- data.frame(
  gene = rownames(res),
  statistic = res$stat  # Wald statistic
)
sig2drug_input <- sig2drug_input[order(-sig2drug_input$statistic), ]

limma-voom

library(limma)
library(edgeR)

# Create DGEList and normalize
dge <- DGEList(counts = counts)
dge <- calcNormFactors(dge)

# Design matrix
design <- model.matrix(~ condition)

# voom transformation and fit
v <- voom(dge, design)
fit <- lmFit(v, design)
fit <- eBayes(fit)

# Extract moderated t-statistic for Sig2Drug
res <- topTable(fit, coef = 2, number = Inf, sort.by = "none")
sig2drug_input <- data.frame(
  gene = rownames(res),
  statistic = res$t  # Moderated t-statistic
)
sig2drug_input <- sig2drug_input[order(-sig2drug_input$statistic), ]

edgeR

library(edgeR)

# Create DGEList and estimate dispersions
dge <- DGEList(counts = counts, group = condition)
dge <- calcNormFactors(dge)
dge <- estimateDisp(dge)

# Exact test or GLM
et <- exactTest(dge)
res <- topTags(et, n = Inf)$table

# Create signed statistic for Sig2Drug
sig2drug_input <- data.frame(
  gene = rownames(res),
  statistic = sign(res$logFC) * -log10(res$PValue)
)
sig2drug_input <- sig2drug_input[order(-sig2drug_input$statistic), ]

Single-Cell RNA-seq

Single-cell data requires special consideration because cells are not independent biological replicates - they come from the same samples. Using cell-level statistics can lead to inflated significance.

Pseudobulk is STRONGLY Preferred

Whenever you have multiple biological samples (e.g., multiple patients, multiple mice), aggregate cells into pseudobulk profiles per sample, then use bulk RNA-seq methods (DESeq2, limma). This accounts for true biological variation and avoids pseudoreplication.

Option 1: Pseudobulk + DESeq2 (Best)

When to use: You have cells from multiple biological samples (≥3 disease + ≥3 control samples).

library(Seurat)
library(DESeq2)

# Aggregate cells by sample
pseudobulk <- AggregateExpression(
  seurat_obj,
  group.by = "sample_id",  # Your sample/patient column
  slot = "counts"
)$RNA

# Create metadata (one row per sample)
sample_meta <- seurat_obj@meta.data[!duplicated(seurat_obj$sample_id), 
                                    c("sample_id", "condition")]

# Run DESeq2 on pseudobulk
dds <- DESeqDataSetFromMatrix(
  countData = pseudobulk,
  colData = sample_meta,
  design = ~ condition
)
dds <- DESeq(dds)
res <- results(dds)

# Extract Wald statistic for Sig2Drug
sig2drug_input <- data.frame(
  gene = rownames(res),
  statistic = res$stat
)
sig2drug_input <- sig2drug_input[order(-sig2drug_input$statistic), ]

Option 2: Pseudobulk + limma-voom

When to use: Same as above. Alternative to DESeq2, especially with larger sample sizes.

library(Seurat)
library(limma)
library(edgeR)

# Aggregate cells by sample
pseudobulk <- AggregateExpression(
  seurat_obj,
  group.by = "sample_id",
  slot = "counts"
)$RNA

# Create metadata (one row per sample)
sample_meta <- seurat_obj@meta.data[!duplicated(seurat_obj$sample_id), 
                                    c("sample_id", "condition")]

# TMM normalization + voom
dge <- DGEList(counts = pseudobulk)
dge <- calcNormFactors(dge)
design <- model.matrix(~ condition, data = sample_meta)
v <- voom(dge, design)
fit <- lmFit(v, design)
fit <- eBayes(fit)

# Extract moderated t-statistic for Sig2Drug
res <- topTable(fit, coef = 2, number = Inf, sort.by = "none")
sig2drug_input <- data.frame(
  gene = rownames(res),
  statistic = res$t
)
sig2drug_input <- sig2drug_input[order(-sig2drug_input$statistic), ]

What if I only have ONE sample per condition?

Without biological replicates, you cannot estimate true biological variance. However, you can still generate a useful signature using cell-level tests. Just be aware that:

P-values will be inflated (many cells = artificial power)
Results reflect technical/cellular variation, not biological
Use results as exploratory, not definitive

Option 3: Cell-Level Analysis (Single Sample)

When to use: You only have one sample per condition (e.g., one patient vs one control, or one treated vs one untreated sample). Use with caution.

library(Seurat)

# Cell-level DE (use when no biological replicates)
markers <- FindMarkers(
  seurat_obj,
  ident.1 = "Disease",
  ident.2 = "Control",
  test.use = "wilcox",  # Wilcoxon is robust
  logfc.threshold = 0,  # Keep all genes
  min.pct = 0.1         # Filter lowly expressed
)

# Create signed statistic
# Use sign(logFC) x -log10(p) to combine direction and significance
sig2drug_input <- data.frame(
  gene = rownames(markers),
  statistic = sign(markers$avg_log2FC) * -log10(markers$p_val + 1e-300)
)
sig2drug_input <- sig2drug_input[order(-sig2drug_input$statistic), ]

# Alternatively, for t-test you can use the t-statistic directly:
# test.use = "t" gives you markers$avg_diff which approximates t

Scanpy (Python)

import scanpy as sc
import numpy as np

# Differential expression
sc.tl.rank_genes_groups(adata, groupby='condition', 
                         method='wilcoxon',  # or 't-test'
                         reference='Control')

# Extract results
result = sc.get.rank_genes_groups_df(adata, group='Disease')

# Create signed statistic for Sig2Drug
result['statistic'] = np.sign(result['logfoldchanges']) * -np.log10(result['pvals'] + 1e-300)
sig2drug_input = result[['names', 'statistic']].sort_values('statistic', ascending=False)

Single-Cell Method Comparison

Scenario	Recommended Approach	Statistic	Quality
Multiple samples per group (≥3)	Pseudobulk + DESeq2/limma-voom	Wald / t-statistic	Best
2 samples per group	Pseudobulk + limma-voom (careful)	Moderated t	OK
1 sample per group	Cell-level Wilcoxon/t-test	sign(logFC) × -log10(p)	Exploratory
MAST (any scenario)	Cell-level with covariates	sign(logFC) × -log10(p)	Good

Why not just use logFC from single-cell?

Log2 fold change alone doesn't account for variance. With thousands of cells, even tiny differences become "significant." The signed -log10(p) approach at least incorporates statistical testing, though p-values remain inflated without true replicates.

Microarray

For microarray data (Affymetrix, Illumina, etc.), limma is the gold standard for differential expression analysis. Microarray data is typically already normalized (log2-scale), so no additional normalization (TMM, voom) is needed:

library(limma)
library(GEOquery)  # If downloading from GEO

# Example: Download and process GEO dataset
gse <- getGEO("GSE12345")[[1]]
expr <- exprs(gse)  # Expression matrix (already log2-normalized)

# Design matrix
condition <- factor(c(rep("Control", 5), rep("Disease", 5)))
design <- model.matrix(~ condition)

# Fit model (no voom needed - data is already continuous/normalized)
fit <- lmFit(expr, design)
fit <- eBayes(fit)

# Get results
res <- topTable(fit, coef = 2, number = Inf)

# Map probe IDs to gene symbols (platform-specific)
# Then extract t-statistic for Sig2Drug
sig2drug_input <- data.frame(
  gene = res$gene_symbol,
  statistic = res$t  # Moderated t-statistic
)
sig2drug_input <- sig2drug_input[order(-sig2drug_input$statistic), ]

Probe-to-gene mapping

If multiple probes map to the same gene, keep the probe with the highest absolute t-statistic, or average the values. Most annotation packages (hgu133plus2.db, illuminaHumanv4.db, etc.) provide probe-to-symbol mappings.

Scoring Methods

Sig2Drug offers multiple scoring methods to evaluate drug-disease signature matching. Each method captures different aspects of signature reversal:

Recommended: XSum + RGES (Combo)

The default "combo" option runs both XSum and RGES, providing complementary views of signature reversal. Use this unless you have a specific reason to choose a single method.

eXtreme Methods (X-methods)

These methods focus on the most differentially expressed genes (top up and top down) rather than using all genes. This makes them robust to noise in weakly expressed genes.

XSum - eXtreme Sum Score

Measures the difference in mean z-scores between disease-upregulated and disease-downregulated genes in the drug signature.

Formula: XSum = mean(z_up) − mean(z_down)

Interpretation: Lower (more negative) = better reversal. A drug that downregulates disease-UP genes and upregulates disease-DOWN genes will have a negative XSum.

XCos - eXtreme Cosine Similarity

Computes the cosine similarity between the disease signature and drug signature for extreme genes.

Formula: XCos = cos(θ) between disease and drug vectors

Interpretation: Ranges from -1 (perfect reversal) to +1 (same direction). Values close to -1 indicate strong reversal potential.

XCor - eXtreme Pearson Correlation

Pearson correlation coefficient between disease and drug expression changes for extreme genes.

Formula: XCor = Pearson(disease_signature, drug_signature)

Interpretation: Ranges from -1 to +1. Negative correlation indicates reversal. More sensitive to linear relationships.

XSpe - eXtreme Spearman Correlation

Spearman rank correlation between disease and drug signatures for extreme genes.

Formula: XSpe = Spearman(disease_signature, drug_signature)

Interpretation: Ranges from -1 to +1. More robust to outliers than Pearson. Captures monotonic relationships.

Other Methods

KS - Kolmogorov-Smirnov Score

Non-parametric enrichment score based on the original Connectivity Map (CMap) methodology. Tests whether up/down gene sets are enriched at opposite ends of the drug-ranked gene list.

Key feature: Works with Up/Down gene lists only - no statistics required. Useful when you only have gene lists without quantitative values.

Interpretation: Negative scores indicate reversal. Based on enrichment statistics rather than direct correlation.

RGES - Reverse Gene Expression Score

A weighted correlation-based score that measures overall signature reversal across all matched genes, with weights based on gene importance.

Key feature: Uses all matched genes, not just extremes. Provides a more global view of reversal.

Interpretation: Ranges from -1 (perfect reversal) to +1 (same direction). RGES < -0.3 suggests meaningful reversal potential.

Method Comparison

Method	Uses	Best for	Requires statistics?
XSum	Extreme genes	General use, robust	Yes (ranked list or matrix)
XCos	Extreme genes	Angular similarity	Yes
XCor	Extreme genes	Linear relationships	Yes
XSpe	Extreme genes	Robust to outliers	Yes
KS	Gene sets	Only have gene lists	No (up/down lists OK)
RGES	All genes	Global reversal view	Yes

Interpreting Results

Lower scores indicate better drug candidates for reversing the disease signature:

XSum Score	Interpretation	Action
< -1.5	Excellent reversal	Strong candidate for validation
-1.5 to -0.8	Good reversal	Worth investigating
-0.8 to -0.3	Moderate reversal	May have partial effect
> -0.3	Weak/no reversal	Unlikely to reverse signature

Validation is essential

Sig2Drug generates computational hypotheses that require experimental validation. Consider the drug's known mechanism of action, safety profile, and relevance to your disease context before proceeding to validation studies.