Back to Analysis

Technical Documentation

Learn how to prepare your gene expression data for drug repurposing analysis. This guide covers all supported input formats, recommended statistics, and best practices for different data types.

Overview

Sig2Drug identifies potential therapeutic compounds by comparing your disease gene expression signature against the integrated database of drug-induced expression profiles (Tahoe-100M, SciPlex, PANACEA, MixSeq, DRUG-seq, DrugReflector, and other). The goal is to find drugs that can reverse the disease signature - upregulating genes that are downregulated in disease, and vice versa.

The quality of your results depends heavily on the quality of your input signature. This documentation will help you prepare optimal input data from various experimental sources.

Key Principle
The more accurately your input reflects true differential expression (not just fold change magnitude), the better your drug reversal results will be. Statistical significance matters more than effect size.

Quick Start

Choose your input type based on what data you have available:

You have... Use this input Quality
DESeq2/limma/edgeR results with statistics Ranked Gene List Best
Single-cell DE results (Wilcoxon, t-test, MAST) Ranked Gene List Best
Raw expression matrix (genes × samples) Count Matrix Convenient
Only gene lists (up/down regulated) Up/Down Genes Basic

Input Types Comparison

Ranked Gene List
Recommended

Provide pre-computed differential expression results as a ranked list of genes with associated statistics. This is the recommended option because you control the DE analysis pipeline and can use the most appropriate method for your data type.

Format Requirements

  • Two columns: gene symbol and statistic
  • Positive values = upregulated in disease
  • Negative values = downregulated in disease
  • Supports: TSV, CSV, semicolon-separated, or Excel (.xlsx)
gene,statistic S100A8,8.92 S100A9,7.85 CXCL8,6.73 ... MUC2,-5.34 CDH1,-6.12 FABP1,-7.45

What is the "statistic" column?

The statistic is a numerical value that captures both the direction and confidence of differential expression for each gene. It tells Sig2Drug not just whether a gene is up or down, but how strongly and reliably it changed. The sign indicates direction (positive = upregulated in disease, negative = downregulated), and the magnitude indicates confidence.

Your pipeline What to use as statistic How to extract it
DESeq2 Wald statistic res$stat
limma / limma-voom Moderated t-statistic topTable(fit)$t
edgeR Signed −log10(p) sign(logFC) * -log10(PValue)
Seurat (Wilcoxon) Signed −log10(p) sign(avg_log2FC) * -log10(p_val)
Seurat (t-test) t-statistic markers$avg_diff (approximates t)
Scanpy Signed −log10(p) sign(logfoldchanges) * -log10(pvals)
Avoid using log2 fold change alone
Log2FC measures only magnitude, not reliability. A gene with log2FC = 5 but high variance is less informative than a gene with log2FC = 2 and low variance. The t-statistic or Wald statistic captures both effect size AND variance, producing more accurate drug matching. See the Statistics Recommendations section for details.
Count Matrix
Convenient

Upload your raw or normalized expression matrix, and Sig2Drug will perform differential expression analysis automatically. Convenient if you don't have access to R/Python DE pipelines, but for best results we recommend running your own DE analysis and submitting a Ranked Gene List.

Format Requirements

  • First column: Gene symbols (HGNC)
  • Remaining columns: Sample expression values
  • Column names must contain control for control samples (case-insensitive)
  • All other columns are treated as disease/condition samples
  • Supports: TSV, CSV, semicolon-separated, or Excel (.xlsx)
gene Disease_1 Disease_2 Disease_3 Control_1 Control_2 Control_3 BRCA1 234.5 245.2 228.9 89.3 92.1 87.6 TP53 567.8 589.4 545.1 234.2 221.8 228.5 MYC 123.4 134.2 118.7 456.7 467.3 448.9
What happens internally
Sig2Drug automatically detects raw count data and applies TMM normalization with limma-voom for proper differential expression analysis. For pre-normalized data (e.g., microarray, log-scale), standard limma is used directly. The moderated t-statistic is used for ranking.
Up/Down Gene Lists
Basic

Provide two separate lists: upregulated and downregulated genes. This is the simplest input format but provides less information for drug matching since ranking information is lost.

Format Requirements

  • Two separate text boxes: UP genes and DOWN genes
  • One gene symbol per line
  • Minimum 25 genes in each list
  • Order within each list doesn't matter
↑ Upregulated
S100A8 S100A9 CXCL8 IL1B MMP9 ...
↓ Downregulated
MUC2 CDH1 FABP1 TFF3 CA2 ...
Limitations
Only KS and XSum methods can be calculated with this input type. RGES requires continuous statistics. Consider using Ranked List or Count Matrix for better results.

Statistics Recommendations

The choice of statistic significantly impacts the quality of drug reversal. Here's a comprehensive guide for different analysis pipelines:

Statistic Source Recommendation Notes
t-statistic limma, t-test Best Accounts for variance, robust to outliers
Wald statistic DESeq2 Best Equivalent to t-statistic for RNA-seq
Moderated t limma eBayes Best Borrows information across genes
Signed -log10(p) Any DE tool Good sign(logFC) × -log10(p-value)
Wilcoxon W Seurat, Scanpy Good Good for single-cell data
Z-score edgeR, various Good Normalized test statistic
Log2 Fold Change Any Avoid Doesn't account for variance or sample size
Why avoid log2 fold change?
Log2 fold change only measures the magnitude of change, not the reliability. A gene with log2FC = 5 but high variance is less reliable than a gene with log2FC = 2 and low variance. The t-statistic captures both effect size AND variance, giving you a more accurate ranking.

Bulk RNA-seq

For bulk RNA-seq data, we recommend using established DE pipelines. Here are examples for the most common tools:

DESeq2 (Recommended)

library(DESeq2)

# Create DESeq2 object
dds <- DESeqDataSetFromMatrix(countData = counts,
                              colData = metadata,
                              design = ~ condition)
dds <- DESeq(dds)
res <- results(dds)

# Extract Wald statistic for Sig2Drug
sig2drug_input <- data.frame(
  gene = rownames(res),
  statistic = res$stat  # Wald statistic
)
sig2drug_input <- sig2drug_input[order(-sig2drug_input$statistic), ]

limma-voom

library(limma)
library(edgeR)

# Create DGEList and normalize
dge <- DGEList(counts = counts)
dge <- calcNormFactors(dge)

# Design matrix
design <- model.matrix(~ condition)

# voom transformation and fit
v <- voom(dge, design)
fit <- lmFit(v, design)
fit <- eBayes(fit)

# Extract moderated t-statistic for Sig2Drug
res <- topTable(fit, coef = 2, number = Inf, sort.by = "none")
sig2drug_input <- data.frame(
  gene = rownames(res),
  statistic = res$t  # Moderated t-statistic
)
sig2drug_input <- sig2drug_input[order(-sig2drug_input$statistic), ]

edgeR

library(edgeR)

# Create DGEList and estimate dispersions
dge <- DGEList(counts = counts, group = condition)
dge <- calcNormFactors(dge)
dge <- estimateDisp(dge)

# Exact test or GLM
et <- exactTest(dge)
res <- topTags(et, n = Inf)$table

# Create signed statistic for Sig2Drug
sig2drug_input <- data.frame(
  gene = rownames(res),
  statistic = sign(res$logFC) * -log10(res$PValue)
)
sig2drug_input <- sig2drug_input[order(-sig2drug_input$statistic), ]

Single-Cell RNA-seq

Single-cell data requires special consideration because cells are not independent biological replicates - they come from the same samples. Using cell-level statistics can lead to inflated significance.

Pseudobulk is STRONGLY Preferred
Whenever you have multiple biological samples (e.g., multiple patients, multiple mice), aggregate cells into pseudobulk profiles per sample, then use bulk RNA-seq methods (DESeq2, limma). This accounts for true biological variation and avoids pseudoreplication.

Option 1: Pseudobulk + DESeq2 (Best)

When to use: You have cells from multiple biological samples (≥3 disease + ≥3 control samples).

library(Seurat)
library(DESeq2)

# Aggregate cells by sample
pseudobulk <- AggregateExpression(
  seurat_obj,
  group.by = "sample_id",  # Your sample/patient column
  slot = "counts"
)$RNA

# Create metadata (one row per sample)
sample_meta <- seurat_obj@meta.data[!duplicated(seurat_obj$sample_id), 
                                    c("sample_id", "condition")]

# Run DESeq2 on pseudobulk
dds <- DESeqDataSetFromMatrix(
  countData = pseudobulk,
  colData = sample_meta,
  design = ~ condition
)
dds <- DESeq(dds)
res <- results(dds)

# Extract Wald statistic for Sig2Drug
sig2drug_input <- data.frame(
  gene = rownames(res),
  statistic = res$stat
)
sig2drug_input <- sig2drug_input[order(-sig2drug_input$statistic), ]

Option 2: Pseudobulk + limma-voom

When to use: Same as above. Alternative to DESeq2, especially with larger sample sizes.

library(Seurat)
library(limma)
library(edgeR)

# Aggregate cells by sample
pseudobulk <- AggregateExpression(
  seurat_obj,
  group.by = "sample_id",
  slot = "counts"
)$RNA

# Create metadata (one row per sample)
sample_meta <- seurat_obj@meta.data[!duplicated(seurat_obj$sample_id), 
                                    c("sample_id", "condition")]

# TMM normalization + voom
dge <- DGEList(counts = pseudobulk)
dge <- calcNormFactors(dge)
design <- model.matrix(~ condition, data = sample_meta)
v <- voom(dge, design)
fit <- lmFit(v, design)
fit <- eBayes(fit)

# Extract moderated t-statistic for Sig2Drug
res <- topTable(fit, coef = 2, number = Inf, sort.by = "none")
sig2drug_input <- data.frame(
  gene = rownames(res),
  statistic = res$t
)
sig2drug_input <- sig2drug_input[order(-sig2drug_input$statistic), ]
What if I only have ONE sample per condition?
Without biological replicates, you cannot estimate true biological variance. However, you can still generate a useful signature using cell-level tests. Just be aware that:
  • P-values will be inflated (many cells = artificial power)
  • Results reflect technical/cellular variation, not biological
  • Use results as exploratory, not definitive

Option 3: Cell-Level Analysis (Single Sample)

When to use: You only have one sample per condition (e.g., one patient vs one control, or one treated vs one untreated sample). Use with caution.

library(Seurat)

# Cell-level DE (use when no biological replicates)
markers <- FindMarkers(
  seurat_obj,
  ident.1 = "Disease",
  ident.2 = "Control",
  test.use = "wilcox",  # Wilcoxon is robust
  logfc.threshold = 0,  # Keep all genes
  min.pct = 0.1         # Filter lowly expressed
)

# Create signed statistic
# Use sign(logFC) x -log10(p) to combine direction and significance
sig2drug_input <- data.frame(
  gene = rownames(markers),
  statistic = sign(markers$avg_log2FC) * -log10(markers$p_val + 1e-300)
)
sig2drug_input <- sig2drug_input[order(-sig2drug_input$statistic), ]

# Alternatively, for t-test you can use the t-statistic directly:
# test.use = "t" gives you markers$avg_diff which approximates t

Scanpy (Python)

import scanpy as sc
import numpy as np

# Differential expression
sc.tl.rank_genes_groups(adata, groupby='condition', 
                         method='wilcoxon',  # or 't-test'
                         reference='Control')

# Extract results
result = sc.get.rank_genes_groups_df(adata, group='Disease')

# Create signed statistic for Sig2Drug
result['statistic'] = np.sign(result['logfoldchanges']) * -np.log10(result['pvals'] + 1e-300)
sig2drug_input = result[['names', 'statistic']].sort_values('statistic', ascending=False)

Single-Cell Method Comparison

Scenario Recommended Approach Statistic Quality
Multiple samples per group (≥3) Pseudobulk + DESeq2/limma-voom Wald / t-statistic Best
2 samples per group Pseudobulk + limma-voom (careful) Moderated t OK
1 sample per group Cell-level Wilcoxon/t-test sign(logFC) × -log10(p) Exploratory
MAST (any scenario) Cell-level with covariates sign(logFC) × -log10(p) Good
Why not just use logFC from single-cell?
Log2 fold change alone doesn't account for variance. With thousands of cells, even tiny differences become "significant." The signed -log10(p) approach at least incorporates statistical testing, though p-values remain inflated without true replicates.

Microarray

For microarray data (Affymetrix, Illumina, etc.), limma is the gold standard for differential expression analysis. Microarray data is typically already normalized (log2-scale), so no additional normalization (TMM, voom) is needed:

library(limma)
library(GEOquery)  # If downloading from GEO

# Example: Download and process GEO dataset
gse <- getGEO("GSE12345")[[1]]
expr <- exprs(gse)  # Expression matrix (already log2-normalized)

# Design matrix
condition <- factor(c(rep("Control", 5), rep("Disease", 5)))
design <- model.matrix(~ condition)

# Fit model (no voom needed - data is already continuous/normalized)
fit <- lmFit(expr, design)
fit <- eBayes(fit)

# Get results
res <- topTable(fit, coef = 2, number = Inf)

# Map probe IDs to gene symbols (platform-specific)
# Then extract t-statistic for Sig2Drug
sig2drug_input <- data.frame(
  gene = res$gene_symbol,
  statistic = res$t  # Moderated t-statistic
)
sig2drug_input <- sig2drug_input[order(-sig2drug_input$statistic), ]
Probe-to-gene mapping
If multiple probes map to the same gene, keep the probe with the highest absolute t-statistic, or average the values. Most annotation packages (hgu133plus2.db, illuminaHumanv4.db, etc.) provide probe-to-symbol mappings.

Scoring Methods

Sig2Drug offers multiple scoring methods to evaluate drug-disease signature matching. Each method captures different aspects of signature reversal:

Recommended: XSum + RGES (Combo)
The default "combo" option runs both XSum and RGES, providing complementary views of signature reversal. Use this unless you have a specific reason to choose a single method.

eXtreme Methods (X-methods)

These methods focus on the most differentially expressed genes (top up and top down) rather than using all genes. This makes them robust to noise in weakly expressed genes.

XSum - eXtreme Sum Score

Measures the difference in mean z-scores between disease-upregulated and disease-downregulated genes in the drug signature.

Formula: XSum = mean(zup) − mean(zdown)

Interpretation: Lower (more negative) = better reversal. A drug that downregulates disease-UP genes and upregulates disease-DOWN genes will have a negative XSum.

XCos - eXtreme Cosine Similarity

Computes the cosine similarity between the disease signature and drug signature for extreme genes.

Formula: XCos = cos(θ) between disease and drug vectors

Interpretation: Ranges from -1 (perfect reversal) to +1 (same direction). Values close to -1 indicate strong reversal potential.

XCor - eXtreme Pearson Correlation

Pearson correlation coefficient between disease and drug expression changes for extreme genes.

Formula: XCor = Pearson(disease_signature, drug_signature)

Interpretation: Ranges from -1 to +1. Negative correlation indicates reversal. More sensitive to linear relationships.

XSpe - eXtreme Spearman Correlation

Spearman rank correlation between disease and drug signatures for extreme genes.

Formula: XSpe = Spearman(disease_signature, drug_signature)

Interpretation: Ranges from -1 to +1. More robust to outliers than Pearson. Captures monotonic relationships.

Other Methods

KS - Kolmogorov-Smirnov Score

Non-parametric enrichment score based on the original Connectivity Map (CMap) methodology. Tests whether up/down gene sets are enriched at opposite ends of the drug-ranked gene list.

Key feature: Works with Up/Down gene lists only - no statistics required. Useful when you only have gene lists without quantitative values.

Interpretation: Negative scores indicate reversal. Based on enrichment statistics rather than direct correlation.

RGES - Reverse Gene Expression Score

A weighted correlation-based score that measures overall signature reversal across all matched genes, with weights based on gene importance.

Key feature: Uses all matched genes, not just extremes. Provides a more global view of reversal.

Interpretation: Ranges from -1 (perfect reversal) to +1 (same direction). RGES < -0.3 suggests meaningful reversal potential.

Method Comparison

Method Uses Best for Requires statistics?
XSum Extreme genes General use, robust Yes (ranked list or matrix)
XCos Extreme genes Angular similarity Yes
XCor Extreme genes Linear relationships Yes
XSpe Extreme genes Robust to outliers Yes
KS Gene sets Only have gene lists No (up/down lists OK)
RGES All genes Global reversal view Yes

Interpreting Results

Lower scores indicate better drug candidates for reversing the disease signature:

XSum Score Interpretation Action
< -1.5 Excellent reversal Strong candidate for validation
-1.5 to -0.8 Good reversal Worth investigating
-0.8 to -0.3 Moderate reversal May have partial effect
> -0.3 Weak/no reversal Unlikely to reverse signature
Validation is essential
Sig2Drug generates computational hypotheses that require experimental validation. Consider the drug's known mechanism of action, safety profile, and relevance to your disease context before proceeding to validation studies.