Transcriptomics
Overview
What is transcriptomics. The transcriptome of a cell is the complete set of RNA molecules it produces at a given time. Unlike the genome, which is fixed, the transcriptome is dynamic: different cells in the same organism express different subsets of genes, and the same cell changes its expression pattern in response to signals, stress, or developmental cues. Transcriptomics is the large-scale measurement of these expression levels. Its central tool, RNA-seq, converts RNA into cDNA, sequences millions of fragments, and counts how many reads originate from each gene.
Computational challenges. The data volume of a typical RNA-seq experiment is modest (1–10 GB per sample) compared to genome sequencing, but the analysis is layered. Three quantitative questions arise in sequence: (1) How many reads came from gene $g$? (alignment and quantification); (2) Is gene $g$ expressed differently between conditions $A$ and $B$? (differential expression); (3) Do groups of functionally related genes change together? (pathway analysis). The statistical subtlety is that RNA-seq counts are overdispersed, require normalization across samples, and must control for thousands of simultaneous hypotheses.
The single-cell revolution. Classical RNA-seq measures the average expression across millions of cells. But a tissue is heterogeneous: it contains many cell types with very different transcriptomes. Single-cell RNA-seq (scRNA-seq) measures the transcriptome of each cell individually, at the cost of much sparser data. The analysis pipeline involves dimensionality reduction, clustering, and cell type assignment, raising new computational challenges in scalability and statistical inference.
Course
RNA Biology
Definition (RNA, transcriptome). RNA (ribonucleic acid) is a single-stranded polymer over the alphabet $\Sigma_{RNA} = \{\texttt{A}, \texttt{C}, \texttt{G}, \texttt{U}\}$, where uracil (U) replaces thymine (T). RNA is synthesized by RNA polymerase using DNA as a template in a process called transcription. The RNA sequence is identical to the coding (sense) strand of the DNA (with T replaced by U). The transcriptome of a cell at time $t$ is the multiset of all RNA molecules present in the cell at $t$.
Definition (RNA classes). The major classes of RNA in eukaryotes are:
- mRNA (messenger RNA): encodes a protein. It carries information from DNA to the ribosome. In eukaryotes it has a 5' cap, a coding sequence (CDS), and a 3' poly-A tail.
- rRNA (ribosomal RNA): structural and catalytic component of ribosomes. The most abundant RNA by mass ($\sim$80% of total RNA).
- tRNA (transfer RNA): adaptor molecule that delivers amino acids during translation. Each tRNA has an anticodon complementary to one codon.
- ncRNA (non-coding RNA): a broad class including lncRNA (long non-coding RNA, $>$200 nt), miRNA (microRNA, $\sim$22 nt, post-transcriptional repressors), and snRNA (small nuclear RNA, involved in splicing).
Definition (Pre-mRNA, introns, exons, splicing). In eukaryotes, a gene is not a contiguous coding sequence. The primary transcript (pre-mRNA) contains alternating exons (retained in the mature mRNA) and introns (removed). The process of removing introns and joining exons is called splicing, catalyzed by the spliceosome. Let a gene have exons $E_1, \ldots, E_k$ of lengths $\ell_1, \ldots, \ell_k$. The mature mRNA has length $\sum_{i=1}^k \ell_i$, which can be much shorter than the genomic span of the gene (sometimes by a factor of 10–100 in humans, where introns average $\sim$3,400 bp vs. exons of $\sim$170 bp).
Alternative splicing. A single gene can produce multiple distinct mRNA isoforms by including or excluding different exons, or by using alternative splice sites at intron boundaries. A constitutive exon is present in all isoforms; a cassette exon is included in only some. The human genome contains $\sim$20,000 protein-coding genes but $\sim$200,000 distinct transcript isoforms. Alternative splicing is a major source of protein diversity. It complicates RNA-seq quantification because reads spanning a junction must be assigned to the correct isoform, not just to the gene.
complement = str.maketrans('ACGT', 'TGCA')
def transcribe(dna_coding_strand):
return dna_coding_strand.replace('T', 'U')
def splice(pre_mrna, intron_intervals):
# intron_intervals: list of (start, end) 0-based half-open
exons, pos = [], 0
for start, end in sorted(intron_intervals):
exons.append(pre_mrna[pos:start])
pos = end
exons.append(pre_mrna[pos:])
return ''.join(exons)
RNA-seq Technology
Library preparation. RNA-seq converts RNA into sequenceable cDNA in several steps. (1) RNA extraction: total RNA is isolated from cells. (2) RNA selection: since rRNA constitutes $\sim$80% of total RNA, it must be removed. Either poly-A selection (captures mRNAs via oligo-dT beads, discards rRNA and most ncRNA) or rRNA depletion (directly removes rRNA, retains all other RNA) is used. (3) Fragmentation: RNA is fragmented to the target size (typically 200–500 nt). (4) Reverse transcription: RNA fragments are converted to double-stranded cDNA. (5) Adapter ligation and amplification: sequencing adapters are ligated to both ends and the library is PCR-amplified. The result is a set of cDNA fragments, each corresponding to a region of a transcript.
Definition (Paired-end reads, insert size). In paired-end sequencing, both ends of each cDNA fragment are sequenced, producing two reads (called mates) per fragment. If the forward read starts at position $i$ and the reverse read ends at position $j$ on the transcript, the insert size is $j - i + 1$. Paired-end information (1) confirms that both mates align to the same gene or isoform, (2) spans exon-exon junctions when individual reads are too short, and (3) enables estimation of the insert size distribution. In single-end sequencing, only one read is produced per fragment, at lower cost but with less information.
Definition (Stranded library). In a stranded (strand-specific) library, the protocol preserves which DNA strand the RNA was transcribed from. The first-strand cDNA is synthesized with dUTP incorporated into the second strand, which is then degraded before PCR, so only the first strand (same orientation as the original RNA) is amplified. Each read's strand of origin is therefore known. This disambiguates overlapping genes on opposite strands and improves quantification of antisense transcription. In an unstranded library, strand information is lost.
Definition (Sequencing depth for RNA-seq). The sequencing depth $N$ is the total number of reads in a sample. Unlike genome sequencing, optimal depth in RNA-seq depends on the biological question. For standard differential expression of abundant genes: $N \approx 20\text{--}30 \times 10^6$ reads. For detection of lowly expressed genes or rare isoforms: $N \geq 100 \times 10^6$. The critical constraint is not coverage uniformity but count precision: a gene expressed at 0.01% of total mRNA receives only $\sim$100 reads at $N = 10^6$, giving high Poisson noise.
Read Alignment and Quantification
The spliced alignment problem. Aligning RNA-seq reads to a reference genome is harder than genomic alignment because reads can span exon-exon junctions. A 150 bp read may have its first 50 bp from exon $i$ and its last 100 bp from exon $j$, with thousands of intronic bases in between. A standard aligner that does not allow large gaps would fail. Splice-aware aligners (STAR, HISAT2) split reads at candidate junctions and align each piece to its respective exon, with junctions taken from a reference annotation or discovered de novo from the data.
Pseudoalignment. Full splice-aware alignment is expensive: STAR requires $\sim$30 GB of RAM and several hours per sample. An alternative is pseudoalignment (kallisto, salmon). Instead of finding the exact genomic position of each read, pseudoalignment asks: which transcripts are compatible with this read? It builds an index of $k$-mers from all annotated transcripts. For each read, it looks up all $k$-mers in the index and takes the intersection of compatible transcripts. This is $\sim$10–30$\times$ faster and nearly as accurate for quantification, though it does not produce alignment files needed for some downstream analyses (e.g., variant calling, visualization).
from collections import defaultdict
def build_kmer_index(transcripts, k):
# transcripts: dict {transcript_id: sequence}
index = defaultdict(set)
for tid, seq in transcripts.items():
for i in range(len(seq) - k + 1):
index[seq[i:i+k]].add(tid)
return index
def pseudoalign(read, index, k):
compatible = None
for i in range(len(read) - k + 1):
hits = index.get(read[i:i+k])
if hits:
compatible = hits if compatible is None else compatible & hits
return compatible or set()
Definition (Raw counts, count matrix). After alignment, reads are assigned to genes (or transcripts). The raw count $c_{gj}$ is the number of reads in sample $j$ assigned to gene $g$. The count matrix $C \in \mathbb{N}^{G \times J}$ collects all raw counts across $G$ genes and $J$ samples. Raw counts cannot be directly compared between genes (longer genes attract more reads) or between samples (different library sizes produce different total counts). Normalization is required before any comparison.
Definition (RPKM, FPKM, TPM). Let $c_g$ be the raw count of gene $g$, $\ell_g$ its effective length in kilobases, and $N$ the total number of mapped reads in millions. Three normalization units are in common use:
- RPKM (Reads Per Kilobase per Million): $\text{RPKM}_g = c_g / (\ell_g \cdot N)$. Corrects for gene length and library size.
- FPKM (Fragments Per Kilobase per Million): same as RPKM but counts fragments (pairs) rather than individual reads. Used for paired-end data.
- TPM (Transcripts Per Million): $\text{TPM}_g = \frac{c_g / \ell_g}{\sum_{g'} c_{g'} / \ell_{g'}} \times 10^6$. TPM values sum to exactly $10^6$ per sample, making them directly comparable between samples. TPM is now preferred over RPKM/FPKM.
Proposition (TPM sums to $10^6$). For any sample, $\sum_g \text{TPM}_g = 10^6$.
Proof.
Let $r_g = c_g / \ell_g$ and $S = \sum_{g'} r_{g'}$. Then $\text{TPM}_g = (r_g / S) \times 10^6$. Summing over all genes: $\sum_g \text{TPM}_g = \frac{10^6}{S} \sum_g r_g = \frac{10^6}{S} \cdot S = 10^6$.def compute_tpm(counts, lengths_kb):
rates = [c / l for c, l in zip(counts, lengths_kb)]
total = sum(rates)
return [r / total * 1e6 for r in rates]
def compute_rpkm(counts, lengths_kb, total_mapped_millions):
return [c / (l * total_mapped_millions) for c, l in zip(counts, lengths_kb)]
Differential Expression
The differential expression problem. Given two conditions $A$ and $B$ (e.g., treated vs. untreated), each with $n_A$ and $n_B$ biological replicates, we want to identify genes whose expression level differs between conditions. For gene $g$, we test the null hypothesis $H_0^g$: the mean expression of $g$ is equal in $A$ and $B$. The challenge is that RNA-seq counts have a specific statistical structure that must be modeled correctly: they are discrete, non-negative, and exhibit more variance than a Poisson distribution.
Why not Poisson? The number of reads mapping to gene $g$ in a sample of $N$ total reads follows $\text{Binomial}(N, p_g) \approx \text{Poisson}(N p_g)$ for large $N$ and small $p_g$. A Poisson has $\text{Var}(X) = \mathbb{E}[X]$. In practice, biological replicates show variance much larger than the mean: this is called overdispersion. It arises from biological variability between samples -- cell state fluctuations, sample handling, etc. -- that adds extra variance on top of the Poisson sampling noise. A Poisson model underestimates variability and produces far too many false positives.
Definition (Negative binomial model). The negative binomial distribution $\text{NB}(\mu, \alpha)$ with mean $\mu > 0$ and dispersion $\alpha \geq 0$ has variance $\text{Var}(X) = \mu + \alpha \mu^2$. When $\alpha = 0$ it reduces to Poisson. The extra term $\alpha \mu^2$ captures overdispersion: at higher mean expression, absolute variance grows quadratically. In the DESeq2 model, the count $c_{gj}$ for gene $g$ in sample $j$ satisfies $$c_{gj} \sim \text{NB}(\mu_{gj},\, \alpha_g), \qquad \mu_{gj} = s_j \cdot q_{gj}$$ where $s_j$ is a sample-specific size factor (accounting for different library sizes) and $q_{gj}$ is the normalized expression level. The gene-specific dispersion $\alpha_g$ is shared across all samples of the same gene.
Definition (Size factors, median-of-ratios). The size factor $s_j$ of sample $j$ corrects for sequencing depth and RNA composition differences. DESeq2 estimates it as follows. For each gene $g$, compute the geometric mean across samples: $\bar{c}_g = \bigl(\prod_{j=1}^J c_{gj}\bigr)^{1/J}$. The size factor is the median of per-gene ratios: $$s_j = \operatorname{median}_g \left\{\frac{c_{gj}}{\bar{c}_g}\right\}$$ Genes with $\bar{c}_g = 0$ are excluded. Dividing raw counts by $s_j$ gives the normalized counts $\tilde{c}_{gj} = c_{gj} / s_j$, comparable across samples. This estimator is robust to highly expressed outlier genes (which would bias simpler depth normalization).
import math
from statistics import median
def size_factors(count_matrix):
# count_matrix: list of rows, each row = counts for one gene across all samples
n_samples = len(count_matrix[0])
sf = []
for j in range(n_samples):
ratios = []
for row in count_matrix:
# geometric mean over samples for this gene
nonzero = [c for c in row if c > 0]
if len(nonzero) == n_samples:
gm = math.exp(sum(math.log(c) for c in row) / n_samples)
ratios.append(row[j] / gm)
sf.append(median(ratios))
return sf
Dispersion estimation. With only 3–5 replicates, a per-gene estimate of $\alpha_g$ from a single gene's data is very noisy. DESeq2 uses empirical Bayes shrinkage: it first estimates a global trend $\hat{\alpha}({\mu})$ (dispersion as a function of mean expression, pooling information from all genes), then shrinks each gene's maximum-likelihood estimate toward this trend. Highly expressed genes are shrunk less (more data); lowly expressed genes are pulled strongly toward the trend. This stabilizes estimates and improves power for DE testing.
Definition (Log fold-change, Wald test). For gene $g$, the log$_2$ fold-change (LFC) between condition $B$ and condition $A$ is $\text{LFC}_g = \log_2(\hat{q}_{gB} / \hat{q}_{gA})$, where $\hat{q}_{gA}$ and $\hat{q}_{gB}$ are estimated mean normalized expressions. A positive LFC means higher expression in $B$. The Wald test tests $H_0: \text{LFC}_g = 0$ by forming $$W_g = \frac{\widehat{\text{LFC}}_g}{\widehat{\text{SE}}(\widehat{\text{LFC}}_g)}$$ which under $H_0$ is approximately standard Normal. The two-sided $p$-value is $p_g = 2\Phi(-|W_g|)$, where $\Phi$ is the standard Normal CDF.
Definition (Multiple testing, BH correction). Testing $G \approx 20{,}000$ genes simultaneously inflates false positives: even with no true signal, $\sim$1,000 genes would have $p < 0.05$ by chance. The false discovery rate (FDR) is $\text{FDR} = \mathbb{E}[V/R]$, where $V$ is the number of false rejections and $R$ the total number of rejections. The Benjamini--Hochberg (BH) procedure controls FDR at level $q$: sort $p$-values $p_{(1)} \leq \cdots \leq p_{(G)}$; let $k^* = \max\{k : p_{(k)} \leq kq/G\}$; reject all hypotheses $1, \ldots, k^*$. The adjusted $p$-value of gene $(k)$ is $\tilde{p}_{(k)} = \min_{j \geq k}(G \cdot p_{(j)} / j)$.
Theorem (BH controls FDR). If the $p$-values of the true null hypotheses are independent and $\text{Uniform}[0,1]$, the BH procedure at level $q$ satisfies $\text{FDR} \leq q \cdot G_0 / G \leq q$, where $G_0$ is the number of true null hypotheses.
def bh_correction(pvalues, q=0.05):
n = len(pvalues)
indexed = sorted(enumerate(pvalues), key=lambda x: x[1])
adjusted = [0.0] * n
min_so_far = 1.0
for rank, (i, p) in reversed(list(enumerate(indexed, 1))):
adj = min(p * n / rank, 1.0)
min_so_far = min(min_so_far, adj)
adjusted[i] = min_so_far
return adjusted
Transcriptome Assembly
Genome-guided vs. de novo assembly. Transcriptome assembly reconstructs the set of expressed transcript sequences from RNA-seq reads. Two strategies exist. Genome-guided assembly (StringTie, Cufflinks) first aligns reads to a reference genome, then reconstructs transcripts from the alignment. It is the standard when a good reference is available. De novo assembly (Trinity) assembles transcripts without any reference, using de Bruijn graphs built from RNA-seq $k$-mers. It is necessary for organisms lacking a reference genome, but is more computationally intensive and more prone to assembly errors.
Definition (Splice graph). Given splice-aligned reads on a genomic locus, the splice graph $SG = (V, E)$ is a directed acyclic graph where each node $v \in V$ represents a maximal contiguous genomic interval covered by reads without interruption (an exonic segment), and each directed edge $(u, v) \in E$ represents an exon-exon junction observed in at least one spliced read. Each path from a source node to a sink node in $SG$ corresponds to a candidate transcript isoform. The assembly problem reduces to: which paths through $SG$ best explain the observed read coverage?
Isoform abundance estimation by EM. Given candidate isoforms $T_1, \ldots, T_K$ (paths in the splice graph), estimate their abundances $\theta_k \geq 0$ with $\sum_k \theta_k = 1$. Each read $r$ is compatible with a subset $S_r \subseteq [K]$ of isoforms. Assuming reads are generated independently, with read $r$ arising from isoform $k$ with probability proportional to $\theta_k$, the log-likelihood is $$\ell(\boldsymbol{\theta}) = \sum_r \log \sum_{k \in S_r} \theta_k$$ This is maximized by the EM algorithm: the E-step computes fractional assignments $\hat{n}_{rk} = \theta_k / \sum_{k' \in S_r} \theta_{k'}$; the M-step updates $\theta_k \propto \sum_r \hat{n}_{rk}$. Iterating to convergence yields maximum-likelihood isoform abundances.
def em_isoform(compatible_sets, n_isoforms, n_iter=100):
# compatible_sets[r] = set of isoform indices compatible with read r
theta = [1.0 / n_isoforms] * n_isoforms
for _ in range(n_iter):
counts = [0.0] * n_isoforms
for S in compatible_sets:
denom = sum(theta[k] for k in S)
for k in S:
counts[k] += theta[k] / denom
total = sum(counts)
theta = [c / total for c in counts]
return theta
Single-Cell RNA-seq
Droplet-based scRNA-seq. The dominant scRNA-seq technology (10x Genomics Chromium) encapsulates individual cells in nanoliter-scale aqueous droplets, each containing a barcoded bead. The bead releases oligonucleotides with (1) a cell barcode (CBC) that uniquely identifies the droplet, and (2) a unique molecular identifier (UMI), a short random sequence that tags each individual mRNA molecule. All mRNAs in the cell are captured by oligo-dT on the bead, reverse-transcribed, amplified, and sequenced. Because each mRNA gets a distinct UMI, PCR duplicates are removed by collapsing reads with the same CBC and UMI, yielding a count proportional to original molecule count.
Definition (UMI count matrix). After demultiplexing by cell barcode and UMI deduplication, the output is a UMI count matrix $X \in \mathbb{N}^{G \times C}$ where $X_{gc}$ is the number of distinct UMIs observed for gene $g$ in cell $c$. The matrix is extremely sparse: a typical cell detects $\sim$2,000--5,000 genes out of $\sim$30,000, so $\sim$85--93\% of entries are zero. This sparsity has two causes: genuine absence of expression, and technical dropout (a transcript was present but not captured).
Quality control. Not every droplet contains a valid cell. Key QC metrics per barcode: (1) total UMI count: very low values indicate empty droplets or dead cells; (2) number of genes detected: very low suggests empty droplets, very high suggests doublets (two cells co-encapsulated); (3) fraction of mitochondrial UMIs: high values ($>$20--25\%) indicate damaged cells that have lost cytoplasmic RNA but retained mitochondria. Cells failing any threshold are removed before downstream analysis.
Definition (Log-normalization). Each cell's raw counts are divided by its library size $N_c = \sum_g X_{gc}$ and scaled by a factor $S = 10{,}000$ (CP10K -- counts per 10,000): $\tilde{X}_{gc} = (X_{gc} / N_c) \times S$. A log-transformation is applied: $Y_{gc} = \log_2(\tilde{X}_{gc} + 1)$. The $+1$ pseudocount prevents $\log(0)$ and stabilizes variance for low counts. The resulting matrix $Y \in \mathbb{R}^{G \times C}$ is the starting point for all downstream analyses.
Dimensionality reduction: PCA then UMAP. The matrix $Y$ has $G \approx 30{,}000$ dimensions and $C \approx 10{,}000$ cells. Working directly in $\mathbb{R}^G$ is infeasible (curse of dimensionality). The standard pipeline: (1) select the top $D \approx 2{,}000$ highly variable genes (HVGs) -- those with the most biological variation across cells; (2) run PCA on the $D \times C$ submatrix to get a $k$-dimensional embedding (typically $k = 50$); (3) apply UMAP to the $k$-dimensional PCA embedding for 2D visualization. PCA captures linear structure; UMAP preserves local neighborhoods in a nonlinear embedding. Clustering is done in PCA space, not UMAP space (UMAP distorts global distances).
import math
from statistics import mean, variance
def log_normalize(count_matrix, scale=1e4):
# count_matrix[c][g] = UMI count for cell c, gene g
result = []
for row in count_matrix:
total = sum(row)
result.append([math.log2(x / total * scale + 1) for x in row])
return result
def highly_variable_genes(log_norm, top_n=2000):
n_cells = len(log_norm)
n_genes = len(log_norm[0])
col = [[log_norm[c][g] for c in range(n_cells)] for g in range(n_genes)]
var_per_gene = [variance(col[g]) if n_cells > 1 else 0 for g in range(n_genes)]
ranked = sorted(range(n_genes), key=lambda g: -var_per_gene[g])
return set(ranked[:top_n])
Definition (kNN graph and clustering). After PCA, each cell is a point in $\mathbb{R}^k$. A $k$-nearest-neighbor graph (kNN graph) is constructed: each cell is connected to its $k$ nearest neighbors in PCA space (typically $k = 15$, Euclidean distance). Community detection on this graph identifies clusters of cells with similar transcriptional profiles. The Leiden algorithm optimizes a modularity-like quality function over graph partitions. A resolution parameter $\gamma > 0$ controls granularity: larger $\gamma$ yields more, smaller clusters.
Cell type annotation. Clusters have no biological label by default. Annotation is done by finding marker genes -- genes specifically expressed in one cluster and not others. Computationally, marker genes are identified by differential expression between each cluster and all others (one-vs-rest test, typically using a Wilcoxon rank-sum test or logistic regression). The top markers are then matched to known cell type signatures in databases such as CellMarker or PanglaoDB, or automatically assigned using reference-based tools such as SingleR.
Gene Networks and Pathways
Definition (Gene Ontology). The Gene Ontology (GO) is a controlled vocabulary describing gene functions, structured as a directed acyclic graph. Each GO term is a node representing a biological concept; edges represent "is-a" or "part-of" relationships. GO has three independent ontologies: Biological Process (BP, e.g., "DNA repair"), Molecular Function (MF, e.g., "ATP binding"), and Cellular Component (CC, e.g., "nucleus"). The true path rule: if gene $g$ is annotated with term $t$, it is implicitly annotated with all ancestors of $t$ in the DAG.
Definition (Over-representation analysis, hypergeometric test). Given a hit list $S$ (e.g., differentially expressed genes) of size $n$ drawn from a universe of $N$ genes, and a GO term $T$ annotating $K$ genes in the universe, the overlap $k = |S \cap T|$ follows a hypergeometric distribution under the null that $S$ is a random subset: $$P(X = k) = \frac{\binom{K}{k}\binom{N-K}{n-k}}{\binom{N}{n}}$$ The one-sided $p$-value for over-representation is $P(X \geq k) = \sum_{x=k}^{\min(K,n)} P(X = x)$. After BH correction over all tested terms, enriched terms with small adjusted $p$-value represent biological processes disproportionately represented in the hit list.
from math import comb
def hypergeometric_pvalue(N, K, n, k):
# P(X >= k), X ~ Hypergeometric(N, K, n)
return sum(
comb(K, x) * comb(N - K, n - x) / comb(N, n)
for x in range(k, min(K, n) + 1)
)
def over_representation(hit_set, gene_sets, universe):
N = len(universe)
n = len(hit_set & universe)
results = []
for term, genes in gene_sets.items():
K = len(genes & universe)
k = len(hit_set & genes & universe)
if K > 0 and k > 0:
p = hypergeometric_pvalue(N, K, n, k)
results.append((term, k, K, p))
return sorted(results, key=lambda x: x[3])
Definition (Gene Set Enrichment Analysis, GSEA). ORA requires a hard threshold to define the hit list. GSEA avoids this: it takes a ranked list of all $G$ genes (ranked by signed LFC or $-\log_{10}(p)$) and a gene set $T$. Walking down the ranked list, an enrichment score ES is computed as a running sum that increments by $1/|T|$ when a gene in $T$ is encountered and decrements by $1/(G - |T|)$ otherwise. ES($T$) is the maximum deviation from zero. A large positive ES means genes in $T$ cluster at the top of the list (preferentially up-regulated). Significance is assessed by permutation of gene labels.
def gsea_score(ranked_genes, gene_set):
n = len(ranked_genes)
n_set = sum(1 for g in ranked_genes if g in gene_set)
n_miss = n - n_set
if n_set == 0 or n_miss == 0:
return 0.0
running, max_dev = 0.0, 0.0
for g in ranked_genes:
running += 1.0 / n_set if g in gene_set else -1.0 / n_miss
if abs(running) > abs(max_dev):
max_dev = running
return max_dev
Co-expression networks. A gene co-expression network is an undirected weighted graph where nodes are genes and edge weights measure similarity of expression profiles across samples. The Pearson correlation $\rho_{gg'}$ between the expression vectors of genes $g$ and $g'$ (over all samples) is the most common similarity. WGCNA (Weighted Gene Co-expression Network Analysis) raises correlations to a power $\beta$ chosen to enforce scale-free network topology: $w_{gg'} = |\rho_{gg'}|^\beta$. Hierarchical clustering of the resulting adjacency matrix partitions genes into modules -- groups of co-expressed genes. Each module is summarized by its eigengene (first principal component), which can be correlated with external traits to find biologically relevant modules.