Technical Documentation
Learn how to prepare your gene expression data for drug repurposing analysis. This guide covers all supported input formats, recommended statistics, and best practices for different data types.
Overview
Sig2Drug identifies potential therapeutic compounds by comparing your disease gene expression signature against the integrated database of drug-induced expression profiles (Tahoe-100M, SciPlex, PANACEA, MixSeq, DRUG-seq, DrugReflector, and other). The goal is to find drugs that can reverse the disease signature - upregulating genes that are downregulated in disease, and vice versa.
The quality of your results depends heavily on the quality of your input signature. This documentation will help you prepare optimal input data from various experimental sources.
Quick Start
Choose your input type based on what data you have available:
| You have... | Use this input | Quality |
|---|---|---|
| DESeq2/limma/edgeR results with statistics | Ranked Gene List | Best |
| Single-cell DE results (Wilcoxon, t-test, MAST) | Ranked Gene List | Best |
| Raw expression matrix (genes × samples) | Count Matrix | Convenient |
| Only gene lists (up/down regulated) | Up/Down Genes | Basic |
Input Types Comparison
Provide pre-computed differential expression results as a ranked list of genes with associated statistics. This is the recommended option because you control the DE analysis pipeline and can use the most appropriate method for your data type.
Format Requirements
- Two columns: gene symbol and statistic
- Positive values = upregulated in disease
- Negative values = downregulated in disease
- Supports: TSV, CSV, semicolon-separated, or Excel (.xlsx)
What is the "statistic" column?
The statistic is a numerical value that captures both the direction and confidence of differential expression for each gene. It tells Sig2Drug not just whether a gene is up or down, but how strongly and reliably it changed. The sign indicates direction (positive = upregulated in disease, negative = downregulated), and the magnitude indicates confidence.
| Your pipeline | What to use as statistic | How to extract it |
|---|---|---|
| DESeq2 | Wald statistic | res$stat |
| limma / limma-voom | Moderated t-statistic | topTable(fit)$t |
| edgeR | Signed −log10(p) | sign(logFC) * -log10(PValue) |
| Seurat (Wilcoxon) | Signed −log10(p) | sign(avg_log2FC) * -log10(p_val) |
| Seurat (t-test) | t-statistic | markers$avg_diff (approximates t) |
| Scanpy | Signed −log10(p) | sign(logfoldchanges) * -log10(pvals) |
Upload your raw or normalized expression matrix, and Sig2Drug will perform differential expression analysis automatically. Convenient if you don't have access to R/Python DE pipelines, but for best results we recommend running your own DE analysis and submitting a Ranked Gene List.
Format Requirements
- First column: Gene symbols (HGNC)
- Remaining columns: Sample expression values
- Column names must contain
controlfor control samples (case-insensitive) - All other columns are treated as disease/condition samples
- Supports: TSV, CSV, semicolon-separated, or Excel (.xlsx)
Provide two separate lists: upregulated and downregulated genes. This is the simplest input format but provides less information for drug matching since ranking information is lost.
Format Requirements
- Two separate text boxes: UP genes and DOWN genes
- One gene symbol per line
- Minimum 25 genes in each list
- Order within each list doesn't matter
Statistics Recommendations
The choice of statistic significantly impacts the quality of drug reversal. Here's a comprehensive guide for different analysis pipelines:
| Statistic | Source | Recommendation | Notes |
|---|---|---|---|
| t-statistic | limma, t-test | Best | Accounts for variance, robust to outliers |
| Wald statistic | DESeq2 | Best | Equivalent to t-statistic for RNA-seq |
| Moderated t | limma eBayes | Best | Borrows information across genes |
| Signed -log10(p) | Any DE tool | Good | sign(logFC) × -log10(p-value) |
| Wilcoxon W | Seurat, Scanpy | Good | Good for single-cell data |
| Z-score | edgeR, various | Good | Normalized test statistic |
| Log2 Fold Change | Any | Avoid | Doesn't account for variance or sample size |
Bulk RNA-seq
For bulk RNA-seq data, we recommend using established DE pipelines. Here are examples for the most common tools:
DESeq2 (Recommended)
library(DESeq2)
# Create DESeq2 object
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = metadata,
design = ~ condition)
dds <- DESeq(dds)
res <- results(dds)
# Extract Wald statistic for Sig2Drug
sig2drug_input <- data.frame(
gene = rownames(res),
statistic = res$stat # Wald statistic
)
sig2drug_input <- sig2drug_input[order(-sig2drug_input$statistic), ]
limma-voom
library(limma)
library(edgeR)
# Create DGEList and normalize
dge <- DGEList(counts = counts)
dge <- calcNormFactors(dge)
# Design matrix
design <- model.matrix(~ condition)
# voom transformation and fit
v <- voom(dge, design)
fit <- lmFit(v, design)
fit <- eBayes(fit)
# Extract moderated t-statistic for Sig2Drug
res <- topTable(fit, coef = 2, number = Inf, sort.by = "none")
sig2drug_input <- data.frame(
gene = rownames(res),
statistic = res$t # Moderated t-statistic
)
sig2drug_input <- sig2drug_input[order(-sig2drug_input$statistic), ]
edgeR
library(edgeR)
# Create DGEList and estimate dispersions
dge <- DGEList(counts = counts, group = condition)
dge <- calcNormFactors(dge)
dge <- estimateDisp(dge)
# Exact test or GLM
et <- exactTest(dge)
res <- topTags(et, n = Inf)$table
# Create signed statistic for Sig2Drug
sig2drug_input <- data.frame(
gene = rownames(res),
statistic = sign(res$logFC) * -log10(res$PValue)
)
sig2drug_input <- sig2drug_input[order(-sig2drug_input$statistic), ]
Single-Cell RNA-seq
Single-cell data requires special consideration because cells are not independent biological replicates - they come from the same samples. Using cell-level statistics can lead to inflated significance.
Option 1: Pseudobulk + DESeq2 (Best)
When to use: You have cells from multiple biological samples (≥3 disease + ≥3 control samples).
library(Seurat)
library(DESeq2)
# Aggregate cells by sample
pseudobulk <- AggregateExpression(
seurat_obj,
group.by = "sample_id", # Your sample/patient column
slot = "counts"
)$RNA
# Create metadata (one row per sample)
sample_meta <- seurat_obj@meta.data[!duplicated(seurat_obj$sample_id),
c("sample_id", "condition")]
# Run DESeq2 on pseudobulk
dds <- DESeqDataSetFromMatrix(
countData = pseudobulk,
colData = sample_meta,
design = ~ condition
)
dds <- DESeq(dds)
res <- results(dds)
# Extract Wald statistic for Sig2Drug
sig2drug_input <- data.frame(
gene = rownames(res),
statistic = res$stat
)
sig2drug_input <- sig2drug_input[order(-sig2drug_input$statistic), ]
Option 2: Pseudobulk + limma-voom
When to use: Same as above. Alternative to DESeq2, especially with larger sample sizes.
library(Seurat)
library(limma)
library(edgeR)
# Aggregate cells by sample
pseudobulk <- AggregateExpression(
seurat_obj,
group.by = "sample_id",
slot = "counts"
)$RNA
# Create metadata (one row per sample)
sample_meta <- seurat_obj@meta.data[!duplicated(seurat_obj$sample_id),
c("sample_id", "condition")]
# TMM normalization + voom
dge <- DGEList(counts = pseudobulk)
dge <- calcNormFactors(dge)
design <- model.matrix(~ condition, data = sample_meta)
v <- voom(dge, design)
fit <- lmFit(v, design)
fit <- eBayes(fit)
# Extract moderated t-statistic for Sig2Drug
res <- topTable(fit, coef = 2, number = Inf, sort.by = "none")
sig2drug_input <- data.frame(
gene = rownames(res),
statistic = res$t
)
sig2drug_input <- sig2drug_input[order(-sig2drug_input$statistic), ]
- P-values will be inflated (many cells = artificial power)
- Results reflect technical/cellular variation, not biological
- Use results as exploratory, not definitive
Option 3: Cell-Level Analysis (Single Sample)
When to use: You only have one sample per condition (e.g., one patient vs one control, or one treated vs one untreated sample). Use with caution.
library(Seurat)
# Cell-level DE (use when no biological replicates)
markers <- FindMarkers(
seurat_obj,
ident.1 = "Disease",
ident.2 = "Control",
test.use = "wilcox", # Wilcoxon is robust
logfc.threshold = 0, # Keep all genes
min.pct = 0.1 # Filter lowly expressed
)
# Create signed statistic
# Use sign(logFC) x -log10(p) to combine direction and significance
sig2drug_input <- data.frame(
gene = rownames(markers),
statistic = sign(markers$avg_log2FC) * -log10(markers$p_val + 1e-300)
)
sig2drug_input <- sig2drug_input[order(-sig2drug_input$statistic), ]
# Alternatively, for t-test you can use the t-statistic directly:
# test.use = "t" gives you markers$avg_diff which approximates t
Scanpy (Python)
import scanpy as sc
import numpy as np
# Differential expression
sc.tl.rank_genes_groups(adata, groupby='condition',
method='wilcoxon', # or 't-test'
reference='Control')
# Extract results
result = sc.get.rank_genes_groups_df(adata, group='Disease')
# Create signed statistic for Sig2Drug
result['statistic'] = np.sign(result['logfoldchanges']) * -np.log10(result['pvals'] + 1e-300)
sig2drug_input = result[['names', 'statistic']].sort_values('statistic', ascending=False)
Single-Cell Method Comparison
| Scenario | Recommended Approach | Statistic | Quality |
|---|---|---|---|
| Multiple samples per group (≥3) | Pseudobulk + DESeq2/limma-voom | Wald / t-statistic | Best |
| 2 samples per group | Pseudobulk + limma-voom (careful) | Moderated t | OK |
| 1 sample per group | Cell-level Wilcoxon/t-test | sign(logFC) × -log10(p) | Exploratory |
| MAST (any scenario) | Cell-level with covariates | sign(logFC) × -log10(p) | Good |
Microarray
For microarray data (Affymetrix, Illumina, etc.), limma is the gold standard for differential expression analysis. Microarray data is typically already normalized (log2-scale), so no additional normalization (TMM, voom) is needed:
library(limma)
library(GEOquery) # If downloading from GEO
# Example: Download and process GEO dataset
gse <- getGEO("GSE12345")[[1]]
expr <- exprs(gse) # Expression matrix (already log2-normalized)
# Design matrix
condition <- factor(c(rep("Control", 5), rep("Disease", 5)))
design <- model.matrix(~ condition)
# Fit model (no voom needed - data is already continuous/normalized)
fit <- lmFit(expr, design)
fit <- eBayes(fit)
# Get results
res <- topTable(fit, coef = 2, number = Inf)
# Map probe IDs to gene symbols (platform-specific)
# Then extract t-statistic for Sig2Drug
sig2drug_input <- data.frame(
gene = res$gene_symbol,
statistic = res$t # Moderated t-statistic
)
sig2drug_input <- sig2drug_input[order(-sig2drug_input$statistic), ]
Scoring Methods
Sig2Drug offers multiple scoring methods to evaluate drug-disease signature matching. Each method captures different aspects of signature reversal:
eXtreme Methods (X-methods)
These methods focus on the most differentially expressed genes (top up and top down) rather than using all genes. This makes them robust to noise in weakly expressed genes.
XSum - eXtreme Sum Score
Measures the difference in mean z-scores between disease-upregulated and disease-downregulated genes in the drug signature.
Formula: XSum = mean(zup) − mean(zdown)
Interpretation: Lower (more negative) = better reversal. A drug that downregulates disease-UP genes and upregulates disease-DOWN genes will have a negative XSum.
XCos - eXtreme Cosine Similarity
Computes the cosine similarity between the disease signature and drug signature for extreme genes.
Formula: XCos = cos(θ) between disease and drug vectors
Interpretation: Ranges from -1 (perfect reversal) to +1 (same direction). Values close to -1 indicate strong reversal potential.
XCor - eXtreme Pearson Correlation
Pearson correlation coefficient between disease and drug expression changes for extreme genes.
Formula: XCor = Pearson(disease_signature, drug_signature)
Interpretation: Ranges from -1 to +1. Negative correlation indicates reversal. More sensitive to linear relationships.
XSpe - eXtreme Spearman Correlation
Spearman rank correlation between disease and drug signatures for extreme genes.
Formula: XSpe = Spearman(disease_signature, drug_signature)
Interpretation: Ranges from -1 to +1. More robust to outliers than Pearson. Captures monotonic relationships.
Other Methods
KS - Kolmogorov-Smirnov Score
Non-parametric enrichment score based on the original Connectivity Map (CMap) methodology. Tests whether up/down gene sets are enriched at opposite ends of the drug-ranked gene list.
Key feature: Works with Up/Down gene lists only - no statistics required. Useful when you only have gene lists without quantitative values.
Interpretation: Negative scores indicate reversal. Based on enrichment statistics rather than direct correlation.
RGES - Reverse Gene Expression Score
A weighted correlation-based score that measures overall signature reversal across all matched genes, with weights based on gene importance.
Key feature: Uses all matched genes, not just extremes. Provides a more global view of reversal.
Interpretation: Ranges from -1 (perfect reversal) to +1 (same direction). RGES < -0.3 suggests meaningful reversal potential.
Method Comparison
| Method | Uses | Best for | Requires statistics? |
|---|---|---|---|
| XSum | Extreme genes | General use, robust | Yes (ranked list or matrix) |
| XCos | Extreme genes | Angular similarity | Yes |
| XCor | Extreme genes | Linear relationships | Yes |
| XSpe | Extreme genes | Robust to outliers | Yes |
| KS | Gene sets | Only have gene lists | No (up/down lists OK) |
| RGES | All genes | Global reversal view | Yes |
Interpreting Results
Lower scores indicate better drug candidates for reversing the disease signature:
| XSum Score | Interpretation | Action |
|---|---|---|
| < -1.5 | Excellent reversal | Strong candidate for validation |
| -1.5 to -0.8 | Good reversal | Worth investigating |
| -0.8 to -0.3 | Moderate reversal | May have partial effect |
| > -0.3 | Weak/no reversal | Unlikely to reverse signature |