Proteomics

Overview

The proteome. While the genome is a static blueprint, proteins are the cell's active machinery: they catalyze reactions, transmit signals, form structures, and regulate gene expression. The proteome is the complete set of proteins expressed by a cell at a given time. Unlike the genome, it is dynamic -- it changes with the cell's state, environment, and developmental stage. A human cell expresses roughly 10,000-15,000 distinct proteins at any moment, from ~20,000 protein-coding genes.

Three computational pillars. Proteomics organizes around three major computational problems. Sequence search: given a query protein, find similar sequences in large databases to infer function or evolutionary relationships. Mass spectrometry analysis: identify which proteins are present in a biological sample by matching peptide fragment spectra to a database of theoretical spectra. Structure prediction: given a sequence, predict the 3D arrangement of atoms -- recently transformed by deep learning (AlphaFold2).

The interactome. Proteins rarely act alone. They form transient and stable complexes, phosphorylate each other, compete for binding partners, and organize into functional modules. The interactome is the network of all protein-protein interactions in a cell, modeled as a graph where vertices are proteins and edges are interactions. Analyzing this network reveals essential proteins (hubs), how perturbations propagate, and how biological functions are modularized.

Harder than genomics. Proteomics is computationally harder than genomics for several reasons. Proteins are defined over 20 residues (vs. 4 nucleotides), making the sequence space far larger. Post-translational modifications (phosphorylation, glycosylation, ubiquitination, and hundreds of others) create enormous chemical diversity beyond what the sequence alone encodes. Structure prediction requires reasoning over continuous 3D space. And unlike the fixed genome, protein abundance must be inferred indirectly through mass spectrometry.

Course

Proteins and the Genetic Code

What proteins do. Proteins are the workhorses of the cell. A few concrete examples: insulin (51 residues) signals cells to take up glucose; hemoglobin (~574 residues, 4 subunits) carries oxygen in blood; DNA polymerase replicates the genome at ~1000 bases per second; collagen provides mechanical strength to tendons and skin; p53 detects DNA damage and triggers cell death. Each of these proteins does something chemically and physically specific -- and that specificity comes entirely from its 3D shape, which in turn comes entirely from its sequence of amino acids. This is the central claim of molecular biology: sequence determines structure determines function.

Definition (Amino acid). An amino acid is a molecule with an amino group ($-\text{NH}_2$), a carboxyl group ($-\text{COOH}$), and a variable side chain (R group) attached to a central $\alpha$-carbon. There are 20 canonical amino acids, grouped by side chain chemistry: nonpolar/aliphatic (Gly, Ala, Val, Leu, Ile, Pro, Met), aromatic (Phe, Trp, Tyr), polar uncharged (Ser, Thr, Cys, Asn, Gln), positively charged (Lys, Arg, His), negatively charged (Asp, Glu). Each has a three-letter code (e.g., Ala) and a one-letter code (e.g., A).

Definition (Polypeptide, protein). A peptide bond is a covalent bond formed between the carboxyl group of one amino acid and the amino group of the next, with loss of a water molecule. A polypeptide is a linear chain of amino acids joined by peptide bonds. A protein is a polypeptide (or assembly of polypeptides) folded into a specific 3D conformation to carry out a biological function. The primary structure is the amino acid sequence, written from N-terminus (free amino group) to C-terminus (free carboxyl group).

Definition (Genetic code). The genetic code maps each of the 64 possible codons (RNA triplets) to one of 20 amino acids or a stop signal. Three codons are stop codons (UAA, UAG, UGA). The code is degenerate: most amino acids are encoded by 2-6 synonymous codons. Degeneracy is concentrated at the third codon position (wobble position): two codons differing only at position 3 often encode the same amino acid, because a tRNA anticodon can tolerate non-Watson-Crick pairing at that position. The code is nearly universal across all life.

Definition (Open reading frame). An open reading frame (ORF) is a contiguous sequence of codons beginning with a start codon (AUG, encoding Met) and ending with a stop codon. In a double-stranded DNA sequence of length $n$, there are six reading frames: three on each strand, shifted by 0, 1, or 2 nucleotides. ORF prediction uses codon usage bias, ORF length, and conservation to distinguish true protein-coding genes from random ORFs, which occur frequently by chance in long sequences.

Python implementation. Codon table and translation of a DNA coding sequence to a protein sequence.

CODON_TABLE = {
    'TTT': 'F', 'TTC': 'F', 'TTA': 'L', 'TTG': 'L',
    'CTT': 'L', 'CTC': 'L', 'CTA': 'L', 'CTG': 'L',
    'ATT': 'I', 'ATC': 'I', 'ATA': 'I', 'ATG': 'M',
    'GTT': 'V', 'GTC': 'V', 'GTA': 'V', 'GTG': 'V',
    'TCT': 'S', 'TCC': 'S', 'TCA': 'S', 'TCG': 'S',
    'CCT': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P',
    'ACT': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T',
    'GCT': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A',
    'TAT': 'Y', 'TAC': 'Y', 'TAA': '*', 'TAG': '*',
    'CAT': 'H', 'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q',
    'AAT': 'N', 'AAC': 'N', 'AAA': 'K', 'AAG': 'K',
    'GAT': 'D', 'GAC': 'D', 'GAA': 'E', 'GAG': 'E',
    'TGT': 'C', 'TGC': 'C', 'TGA': '*', 'TGG': 'W',
    'CGT': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R',
    'AGT': 'S', 'AGC': 'S', 'AGA': 'R', 'AGG': 'R',
    'GGT': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G',
}

def translate(dna):
    protein = []
    for i in range(0, len(dna) - 2, 3):
        aa = CODON_TABLE.get(dna[i:i+3], '?')
        if aa == '*':
            break
        protein.append(aa)
    return ''.join(protein)

def find_orfs(dna, min_len=100):
    orfs = []
    for frame in range(3):
        i = frame
        while i + 3 <= len(dna):
            if dna[i:i+3] == 'ATG':
                for j in range(i + 3, len(dna) - 2, 3):
                    if CODON_TABLE.get(dna[j:j+3]) == '*':
                        if j - i >= min_len:
                            orfs.append((i, j + 3, frame))
                        break
            i += 3
    return orfs

Sequence Search and Databases

Protein databases. UniProt is the central repository for protein sequences and annotation. It divides into Swiss-Prot (manually reviewed, ~570,000 entries with curated functional annotation) and TrEMBL (automatically annotated, ~250 million entries). The Protein Data Bank (PDB) holds experimentally determined 3D structures. The FASTA format represents a sequence as a header line beginning with > followed by lines of single-letter residue codes. Identifying similar sequences for an unknown protein is the first step toward inferring its function.

Sequence similarity and function. Why compare sequences at all? Because evolution is conservative: proteins that descended from a common ancestor tend to retain similar sequences and similar functions. If we discover a new protein in an unstudied organism and find that it shares 60% amino acid identity with a human enzyme of known function, we have strong evidence it performs the same reaction. As a rule of thumb: above ~30% identity over the full length, two proteins almost certainly share the same fold; below ~20%, similarity is unreliable without additional evidence. The core computational task is therefore: given a query sequence, find all sequences in a large database that are significantly similar to it.

Scoring substitutions: BLOSUM62. A simple match/mismatch score works for DNA (4-letter alphabet, similar letters). For proteins (20-letter alphabet), some substitutions are much more tolerable than others: Val and Leu are both hydrophobic and nonpolar, so V→L is common in evolution; Asp (negatively charged) and Arg (positively charged) are chemically opposite, so D→R is rare. The BLOSUM62 matrix encodes these frequencies: entry $s(a,b) = \log_2(f_{ab} / (p_a p_b))$ is the log-odds ratio of observing amino acid pair $(a,b)$ in aligned, conserved protein blocks relative to chance. Positive scores mean the substitution is favored by evolution; negative scores mean it is disfavored.

Definition (Local alignment -- Smith-Waterman). Given two sequences $a = a_1 \cdots a_m$ and $b = b_1 \cdots b_n$ over an amino acid alphabet, a local alignment pairs a contiguous substring of $a$ with a contiguous substring of $b$ to maximize a score. The Smith-Waterman algorithm finds the optimal local alignment in $O(mn)$ time via the recurrence: for $1 \leq i \leq m$, $1 \leq j \leq n$, $$H(i,j) = \max\bigl(0,\; H(i-1,j-1)+s(a_i,b_j),\; H(i-1,j)-g,\; H(i,j-1)-g\bigr)$$ with $H(i,0) = H(0,j) = 0$, where $s(a_i,b_j)$ is the substitution score (from BLOSUM62 for proteins) and $g > 0$ is the linear gap penalty. The optimal score is $\max_{i,j} H(i,j)$, and the alignment is recovered by traceback from that cell.

Python implementation. Smith-Waterman local alignment with a simple match/mismatch scoring scheme.

def smith_waterman(a, b, match=2, mismatch=-1, gap=2):
    m, n = len(a), len(b)
    H = [[0] * (n + 1) for _ in range(m + 1)]
    best, best_pos = 0, (0, 0)
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            s = match if a[i-1] == b[j-1] else mismatch
            H[i][j] = max(0, H[i-1][j-1] + s, H[i-1][j] - gap, H[i][j-1] - gap)
            if H[i][j] > best:
                best, best_pos = H[i][j], (i, j)
    # traceback from best cell
    i, j = best_pos
    aligned_a, aligned_b = [], []
    while H[i][j] > 0:
        s = match if a[i-1] == b[j-1] else mismatch
        if H[i][j] == H[i-1][j-1] + s:
            aligned_a.append(a[i-1]); aligned_b.append(b[j-1]); i -= 1; j -= 1
        elif H[i][j] == H[i-1][j] - gap:
            aligned_a.append(a[i-1]); aligned_b.append('-'); i -= 1
        else:
            aligned_a.append('-'); aligned_b.append(b[j-1]); j -= 1
    return best, ''.join(reversed(aligned_a)), ''.join(reversed(aligned_b))

Definition (BLAST). BLAST (Basic Local Alignment Search Tool) is a heuristic for fast database search that avoids full Smith-Waterman against every sequence. For protein search (BLASTP): (1) Seed: enumerate all words of length $w = 3$ in the query; for each query word, find all words in the amino acid alphabet scoring $\geq T$ against it under BLOSUM62 -- these are the neighborhood words. (2) Scan: scan the database for exact matches to neighborhood words. (3) Extend: extend each hit in both directions (without gaps first, then with gaps); report alignments of score $\geq S$.

Definition (E-value). The E-value of an alignment with score $S$ is the expected number of alignments of equal or better score that would be found by chance in a database of the same size. Under the Karlin-Altschul model: $$E = K \cdot m \cdot n \cdot e^{-\lambda S}$$ where $m$ is the query length, $n$ is the total database size in residues, and $K$, $\lambda$ are parameters determined by the scoring matrix and gap costs. A small E-value (e.g. $E < 10^{-5}$) indicates a statistically significant hit unlikely to arise by chance.

Mass Spectrometry

LC-MS/MS workflow. The standard pipeline for protein identification is: (1) Digestion: the protein mixture is digested with trypsin, which cleaves on the C-terminal side of Lys (K) and Arg (R), not before Pro. This produces peptides of 6-30 residues with predictable ends. (2) LC separation: peptides are separated by reverse-phase liquid chromatography to reduce mixture complexity. (3) MS1: the mass spectrometer measures the $m/z$ of intact peptide ions. (4) MS2: selected precursor ions are fragmented (by HCD or CID), producing fragment ion spectra. (5) Database search: fragment spectra are matched against theoretical spectra from a protein database to identify the peptide.

Definition (Peptide mass). The monoisotopic mass of a peptide of $n$ residues $r_1, \ldots, r_n$ is: $$M = \sum_{i=1}^{n} m_{r_i} + 18.010565 \text{ Da}$$ where $m_{r_i}$ is the monoisotopic residue mass of amino acid $r_i$ and $18.010565$ Da accounts for the water molecule completing the termini. The instrument measures the $m/z$ ratio of the protonated ion: $m/z = (M + z \cdot 1.007276) / z$, where $z$ is the charge state.

Definition (b-ions and y-ions). Fragmentation of a peptide $r_1 r_2 \cdots r_n$ at the peptide bond between positions $k$ and $k+1$ produces two ion series. b-ions retain the N-terminus: $$b_k = \sum_{i=1}^{k} m_{r_i} + 1.007276 \text{ Da}$$ y-ions retain the C-terminus (indexing from the C-terminal end): $$y_k = \sum_{i=n-k+1}^{n} m_{r_i} + 18.010565 + 1.007276 \text{ Da}$$ Together, the $b_1, \ldots, b_{n-1}$ and $y_1, \ldots, y_{n-1}$ series form a mass ladder that can be read to reconstruct the peptide sequence.

Definition (Peptide-spectrum match and FDR). A peptide-spectrum match (PSM) is a candidate assignment of a peptide sequence to an observed MS2 spectrum. A score quantifies agreement between the theoretical and observed spectra; a simple choice is the number of matching fragment ion masses within a tolerance $\delta$ (typically $\delta = 0.02$ Da). To estimate the false discovery rate (FDR), the target-decoy approach concatenates the real database with a decoy database of reversed sequences. At score threshold $\tau$: $$\widehat{\text{FDR}}(\tau) = \frac{\#\text{decoy PSMs} \geq \tau}{\#\text{target PSMs} \geq \tau}$$ The threshold is tuned so that $\widehat{\text{FDR}} \leq 0.01$ (1% FDR is standard).

Python implementation. Theoretical spectrum (b- and y-ion masses) and a simple fragment ion matching score.

RESIDUE_MASS = {
    'A': 71.03711,  'R': 156.10111, 'N': 114.04293, 'D': 115.02694,
    'C': 103.00919, 'E': 129.04259, 'Q': 128.05858, 'G':  57.02146,
    'H': 137.05891, 'I': 113.08406, 'L': 113.08406, 'K': 128.09496,
    'M': 131.04049, 'F': 147.06841, 'P':  97.05276, 'S':  87.03203,
    'T': 101.04768, 'W': 186.07931, 'Y': 163.06333, 'V':  99.06841,
}
PROTON = 1.007276
WATER  = 18.010565

def theoretical_spectrum(peptide):
    masses = [RESIDUE_MASS[aa] for aa in peptide]
    n = len(masses)
    b_ions = [sum(masses[:k]) + PROTON       for k in range(1, n)]
    y_ions = [sum(masses[k:]) + WATER + PROTON for k in range(1, n)]
    return b_ions, y_ions

def fragment_score(b_ions, y_ions, observed_masses, tol=0.02):
    obs = set(round(m / tol) for m in observed_masses)
    return sum(1 for m in b_ions + y_ions if round(m / tol) in obs)

Protein Structure Prediction

Definition (Structural hierarchy). Protein structure is described at four levels. Primary structure: the amino acid sequence. Secondary structure: local regular structures stabilized by backbone hydrogen bonds -- chiefly $\alpha$-helices and $\beta$-sheets. Tertiary structure: the complete 3D fold of a single polypeptide, stabilized by hydrophobic packing, disulfide bonds, salt bridges, and hydrogen bonds. Quaternary structure: the arrangement of multiple polypeptide chains (subunits) into a functional complex.

Alpha-helix and beta-sheet. An $\alpha$-helix is a right-handed spiral with 3.6 residues per turn and a 5.4 Å pitch. Each backbone NH forms a hydrogen bond with the CO of the residue four positions earlier (i to i+4 pattern). A $\beta$-sheet consists of two or more $\beta$-strands running side by side, connected by backbone H-bonds. Strands can be parallel (same N-to-C direction) or antiparallel (opposite directions). Antiparallel $\beta$-sheets form stronger, more linear H-bonds and are more common. Pro cannot be part of a regular $\alpha$-helix (it lacks the backbone NH); Gly can adopt conformations forbidden to other residues.

Definition (Ramachandran plot). The backbone conformation of each residue is described by two dihedral angles: $\phi$ (rotation around the N-$C_\alpha$ bond) and $\psi$ (rotation around the $C_\alpha$-C bond), both in $(-180°, 180°]$. The Ramachandran plot is a scatter plot of $(\phi, \psi)$ values across all residues of a structure. Most residues cluster in sterically allowed regions: $\alpha$-helices ($\phi \approx -57°$, $\psi \approx -47°$) and $\beta$-sheets ($\phi \approx -120°$, $\psi \approx +130°$). Residues in disallowed regions indicate errors in the model.

Levinthal's paradox. If a protein of 100 residues sampled all backbone conformations at $10^{13}$ per second, with only 3 conformations per residue, the $3^{100} \approx 5 \times 10^{47}$ states would take longer than the age of the universe to explore. Yet proteins fold in microseconds to seconds. Folding is not a random search: it is guided by a funnel-shaped energy landscape that channels the chain toward the native state. Structure prediction seeks to find this minimum-energy conformation directly, without simulating the folding pathway.

Definition (Homology modeling). Homology modeling builds a 3D model of a target protein by aligning it onto a template protein of known structure. It is justified by the observation that proteins sharing more than ~30% sequence identity over their full length nearly always adopt the same overall fold. The pipeline: (1) search PDB for homologous templates using BLAST or PSI-BLAST; (2) align target and template sequences; (3) copy backbone coordinates from the template; (4) model loop regions and side chains; (5) energy minimize. Below 30% identity, fold recognition (threading) methods are needed.

AlphaFold2. AlphaFold2 (DeepMind, 2021) achieves near-experimental accuracy for most protein families without requiring a template. Its architecture: (1) Evoformer: takes as input a multiple sequence alignment (MSA) of the target with evolutionary homologs, and a residue-pair representation. Alternating MSA row/column attention and pair update blocks refine both representations, learning evolutionary covariation as a proxy for spatial contacts. (2) Structure module: assigns rigid-body frames to each residue via invariant point attention (IPA), producing 3D coordinates. (3) Confidence: outputs a per-residue pLDDT score (0-100) and a predicted TM-score. pLDDT > 90 indicates very high confidence; 70-90 high; 50-70 low; < 50 very low (often disordered). AlphaFold2 predictions for the entire human proteome are freely available in the AlphaFold Database.

Python implementation. Compute a $C_\alpha$ contact map from 3D coordinates.

import math

def contact_map(ca_coords, threshold=8.0):
    n = len(ca_coords)
    contacts = [[0] * n for _ in range(n)]
    for i in range(n):
        for j in range(i + 4, n):  # skip sequence-local pairs
            d = math.sqrt(sum((ca_coords[i][k] - ca_coords[j][k])**2 for k in range(3)))
            if d <= threshold:
                contacts[i][j] = contacts[j][i] = 1
    return contacts

def hydrophobic_energy(contacts, sequence, hydrophobic='VILMFYW'):
    energy = 0.0
    n = len(sequence)
    for i in range(n):
        for j in range(i + 4, n):
            if contacts[i][j] and sequence[i] in hydrophobic and sequence[j] in hydrophobic:
                energy -= 1.0
    return energy

Structural Analysis

Why compare structures? Sequence comparison has a blind spot: two proteins can share the same 3D fold and function while having less than 20% sequence identity -- too low for BLAST to detect. This is convergent evolution (or very deep homology erased by sequence drift). Structure, being more conserved than sequence across evolution, reveals these relationships. Three practical questions drive structural comparison: (1) Does a newly solved structure match any known fold in the PDB? (2) How accurately does a predicted model (e.g. from AlphaFold) reproduce the experimental structure? (3) How much does a protein's structure change between apo and ligand-bound forms, or between species? All three reduce to the same computational problem: given two sets of 3D atomic coordinates, quantify how similar they are.

Definition (RMSD). Given two sets of $n$ corresponding atom positions $r_1, \ldots, r_n$ and $r_1', \ldots, r_n'$ in $\mathbb{R}^3$, the root mean square deviation is: $$\text{RMSD} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} \|r_i - r_i'\|^2}$$ RMSD measures the average distance between corresponding atoms after optimal superposition. It is typically computed over $C_\alpha$ atoms only. RMSD is sensitive to a few large deviations (flexible loops or termini) and depends on the choice of rotation and translation; the Kabsch algorithm finds the rotation that minimizes it.

Theorem (Kabsch algorithm). Let $P, Q \in \mathbb{R}^{n \times 3}$ be centered (zero-mean) coordinate matrices. The rotation $R \in SO(3)$ minimizing $\|PR - Q\|_F^2$ is given by the SVD of the cross-covariance matrix $H = P^T Q$. Writing $H = U \Sigma V^T$: $$R = V \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & \det(VU^T) \end{pmatrix} U^T$$ The sign correction $\det(VU^T)$ ensures $\det(R) = +1$, i.e. $R \in SO(3)$ rather than an improper rotation (reflection).

Definition (TM-score). The TM-score is a length-normalized structural similarity measure less sensitive to local errors than RMSD: $$\text{TM-score} = \frac{1}{L_{\text{ref}}} \sum_{i=1}^{L_{\text{ali}}} \frac{1}{1 + (d_i / d_0)^2}$$ where $L_{\text{ref}}$ is the length of the reference protein, $L_{\text{ali}}$ is the number of aligned residue pairs, $d_i$ is the distance between the $i$-th aligned pair after superposition, and $d_0 = 1.24 (L_{\text{ref}} - 15)^{1/3} - 1.8$ Å normalizes by protein size. TM-score lies in $(0, 1]$. A TM-score $> 0.5$ generally indicates the same fold; $< 0.17$ is comparable to a random pair.

Python implementation. Basic RMSD and TM-score computation from $C_\alpha$ coordinates.

import math

def rmsd(coords1, coords2):
    n = len(coords1)
    total = sum(
        (coords1[i][0] - coords2[i][0])**2 +
        (coords1[i][1] - coords2[i][1])**2 +
        (coords1[i][2] - coords2[i][2])**2
        for i in range(n)
    )
    return math.sqrt(total / n)

def tm_score(aligned_distances, L_ref):
    d0 = max(1.24 * (L_ref - 15) ** (1/3) - 1.8, 0.5)
    return sum(1.0 / (1 + (d / d0)**2) for d in aligned_distances) / L_ref

Protein-Protein Interactions

Definition (PPI network). A protein-protein interaction network (interactome) is a graph $G = (V, E)$ where each vertex $v \in V$ is a protein and each edge $\{u,v\} \in E$ represents a detected or predicted physical interaction. Edges may carry a confidence weight $w : E \to [0,1]$. The degree $\deg(v)$ is the number of interaction partners of $v$. A hub protein has anomalously high degree. PPI networks are approximately scale-free: the degree distribution follows a power law $P(\deg = k) \propto k^{-\gamma}$, with a small number of highly connected hubs and many low-degree proteins.

Definition (Centrality measures). The degree centrality of $v$ in a network of $n$ proteins is $C_d(v) = \deg(v)/(n-1)$. The betweenness centrality is: $$C_b(v) = \sum_{s \neq v \neq t} \frac{\sigma_{st}(v)}{\sigma_{st}}$$ where $\sigma_{st}$ is the total number of shortest paths from $s$ to $t$ and $\sigma_{st}(v)$ is the number passing through $v$. High betweenness identifies bridge proteins between functional modules. The clustering coefficient of $v$ is: $$C(v) = \frac{2 \,|\{\{i,j\} \in E : i,j \in N(v)\}|}{\deg(v)\,(\deg(v)-1)}$$ measuring how interconnected $v$'s neighbors are.

Experimental methods and databases. PPIs are detected by several techniques. Yeast two-hybrid (Y2H): two proteins are fused to split halves of a transcription factor; interaction reconstitutes a reporter gene. Sensitive but prone to false positives. Co-immunoprecipitation (co-IP): pull down one protein with an antibody and detect interacting partners by mass spectrometry. Crosslinking MS (XL-MS): chemical crosslinkers covalently link proximal residues; crosslinked peptides identified by MS provide distance constraints. STRING integrates experimental, text-mining, and genomic-context evidence for >5000 organisms, assigning confidence scores per edge. Hub proteins tend to be essential for viability, since removing a hub disconnects many functional pathways.

Python implementation. Build a PPI graph and compute degree centrality and local clustering coefficients.

from collections import defaultdict

def build_ppi(interactions):
    graph = defaultdict(set)
    for u, v in interactions:
        graph[u].add(v)
        graph[v].add(u)
    return graph

def degree_centrality(graph):
    n = len(graph)
    return {v: len(nb) / (n - 1) for v, nb in graph.items()}

def clustering_coefficient(graph, v):
    nb = graph[v]
    k = len(nb)
    if k < 2:
        return 0.0
    edges = sum(1 for u in nb for w in nb if u < w and w in graph[u])
    return 2 * edges / (k * (k - 1))

def connected_components(graph):
    visited, components = set(), []
    for start in graph:
        if start not in visited:
            comp, stack = [], [start]
            while stack:
                v = stack.pop()
                if v not in visited:
                    visited.add(v); comp.append(v)
                    stack.extend(graph[v] - visited)
            components.append(comp)
    return components

Functional Annotation

From sequence to function. Given a newly sequenced protein with no experimental characterization, what can we infer about its function? The simplest approach is BLAST: find the closest relative in a database of annotated proteins and transfer its annotation. This works well when close homologs exist (e.g. >50% identity). But it breaks down in two common situations: (1) the protein is genuinely novel, with no close relative in any database; (2) it belongs to a large family where the conserved functional region is only a small domain embedded in a much larger, unrelated sequence. In case (2), global alignment misses the local signal -- we need tools that detect conserved modules rather than full-length similarity.

Definition (Protein domain and Pfam). A protein domain is a conserved sequence and structural unit that folds independently and carries a specific function. Most proteins are modular: they consist of one or more domains connected by flexible linkers. Pfam is a database of protein families and domains, each represented by a profile hidden Markov model (profile HMM) built from a curated multiple sequence alignment. Scanning a query sequence against Pfam identifies which known domains it contains, providing functional hypotheses without requiring experimental characterization.

Definition (Profile HMM). A profile HMM of length $L$ has three states per column: match ($M_k$), insert ($I_k$), and delete ($D_k$) for $k = 1, \ldots, L$. Match states have emission distributions over the 20 amino acids encoding the conservation at that alignment column; insert states model insertions; delete states are silent. Transition probabilities connect consecutive states. The probability of a sequence $x = x_1 \cdots x_n$ is computed by the forward algorithm in $O(nL)$ time; the most likely alignment is found by Viterbi decoding in the same complexity.

PSI-BLAST and position-specific scoring. PSI-BLAST (Position-Specific Iterated BLAST) iteratively refines a database search to detect distant homologs. After the first BLAST run, a position-specific scoring matrix (PSSM) is built from the alignment of all significant hits. At each position $i$ and amino acid $a$: $$\text{PSSM}(i, a) = \log_2 \frac{f_{ia}}{q_a}$$ where $f_{ia}$ is the (pseudo-count-smoothed) frequency of $a$ at position $i$ in the alignment, and $q_a$ is its background frequency. The PSSM replaces BLOSUM62 in the next BLAST iteration. After 3-5 iterations PSI-BLAST detects homologs invisible to standard BLAST.

Definition (Gene Ontology). The Gene Ontology (GO) is a controlled vocabulary for annotating gene products across three independent ontologies: Biological Process (BP): the biological program in which the protein participates (e.g., "DNA repair", "glycolysis"); Molecular Function (MF): the biochemical activity at the molecular level (e.g., "kinase activity", "DNA binding"); Cellular Component (CC): where the protein exerts its function (e.g., "nucleus", "ribosome"). GO terms form a directed acyclic graph: if a protein is annotated with a term, it is implicitly annotated with all its ancestors (true path rule).

Proposition (GO enrichment -- hypergeometric test). Given a set of $n$ proteins of interest drawn from a universe of $N$ proteins, with $K$ proteins in the universe annotated with GO term $t$, the p-value for over-representation of $t$ among the $k$ annotated proteins in the set is: $$p = \sum_{x=k}^{\min(K,n)} \frac{\dbinom{K}{x}\dbinom{N-K}{n-x}}{\dbinom{N}{n}}$$ This is the upper tail probability of the hypergeometric distribution. After computing p-values for all GO terms, Benjamini-Hochberg correction controls the FDR across multiple tests.

Python implementation. GO enrichment via the hypergeometric test, and Viterbi decoding against a simplified profile HMM.

from math import comb

def go_enrichment(N, K, n, k):
    """Hypergeometric p-value: k or more of n proteins annotated with term t,
    given K annotated proteins in a universe of N."""
    return sum(
        comb(K, x) * comb(N - K, n - x) / comb(N, n)
        for x in range(k, min(K, n) + 1)
    )

def bh_correction(pvalues, q=0.05):
    n = len(pvalues)
    indexed = sorted(enumerate(pvalues), key=lambda x: x[1])
    adjusted = [0.0] * n
    min_so_far = 1.0
    for rank, (i, p) in reversed(list(enumerate(indexed, 1))):
        adj = min(p * n / rank, 1.0)
        min_so_far = min(min_so_far, adj)
        adjusted[i] = min_so_far
    return adjusted

def viterbi_profile(sequence, emit_M, emit_I, trans, L):
    """Simplified Viterbi for a profile HMM with match/insert/delete states.
    emit_M[k][aa]: log-prob of aa in match state k (k=1..L).
    emit_I[k][aa]: log-prob of aa in insert state k.
    trans[k]: dict with keys MM, MI, MD, IM, II, DM, DD (log-probs).
    """
    NEG_INF = float('-inf')
    n = len(sequence)
    M = [[NEG_INF] * (L + 1) for _ in range(n + 1)]
    I = [[NEG_INF] * (L + 1) for _ in range(n + 1)]
    D = [[NEG_INF] * (L + 1) for _ in range(n + 1)]
    M[0][0] = 0.0
    for i in range(1, n + 1):
        aa = sequence[i - 1]
        for k in range(1, L + 1):
            e_m = emit_M[k].get(aa, NEG_INF)
            e_i = emit_I[k].get(aa, NEG_INF)
            t = trans[k - 1]
            M[i][k] = e_m + max(
                M[i-1][k-1] + t.get('MM', NEG_INF),
                I[i-1][k-1] + t.get('IM', NEG_INF),
                D[i-1][k-1] + t.get('DM', NEG_INF),
            )
            I[i][k] = e_i + max(
                M[i-1][k] + t.get('MI', NEG_INF),
                I[i-1][k] + t.get('II', NEG_INF),
            )
            D[i][k] = max(
                M[i][k-1] + t.get('MD', NEG_INF),
                D[i][k-1] + t.get('DD', NEG_INF),
            )
    return max(M[n][L], I[n][L], D[n][L])