Genomics
Overview
The central dogma. The flow of genetic information in a cell follows a single direction: DNA is transcribed into RNA, which is translated into protein. DNA is a double-stranded polymer made of four nucleotides - adenine (A), cytosine (C), guanine (G), thymine (T) - held together by complementary base pairing (A–T, C–G). The complete DNA sequence of an organism is its genome. Genomics is the study of genomes: their structure, content, and evolution.
Computational challenges. A human genome contains roughly $3 \times 10^9$ base pairs. No current technology reads it in one pass: sequencers produce millions of short reads (fragments of 100–30,000 bp depending on the platform), and computational methods must reconstruct the original sequence, identify variants, and annotate functional elements. The core algorithmic problems are sequence alignment (comparing two sequences) and genome assembly (reconstructing a genome from overlapping reads).
String algorithms at the core. A DNA sequence is a string over the alphabet $\Sigma = \{\texttt{A}, \texttt{C}, \texttt{G}, \texttt{T}\}$. Genomics imports and adapts classical string algorithms - dynamic programming, suffix arrays, hash tables - and scales them to gigabase inputs. The dominant complexity constraints are time (reads number in the millions) and memory (a genome-scale suffix array requires tens of gigabytes).
Course
DNA and the Genome
Definition (DNA sequence). A DNA sequence of length $n$ is a string $s = s_1 s_2 \cdots s_n$ over the alphabet $\Sigma = \{\texttt{A}, \texttt{C}, \texttt{G}, \texttt{T}\}$. The complement of a base is defined by $\overline{\texttt{A}} = \texttt{T}$, $\overline{\texttt{T}} = \texttt{A}$, $\overline{\texttt{C}} = \texttt{G}$, $\overline{\texttt{G}} = \texttt{C}$. The reverse complement of $s$ is the string $\overline{s_n}\, \overline{s_{n-1}} \cdots \overline{s_1}$. In the double helix, each strand is the reverse complement of the other.
Definition (Genome, chromosome, gene). The genome of an organism is the complete set of its hereditary information, encoded as DNA. In eukaryotes, the genome is organized into chromosomes: linear double-stranded DNA molecules packaged with proteins. A gene is a region of the genome that encodes a functional product (typically a protein or a structural RNA). Genes represent roughly 2% of the human genome; the rest consists of regulatory sequences, repetitive elements, and regions of unknown function.
Definition (GC content). The GC content of a sequence $s$ of length $n$ is the fraction of positions occupied by G or C: $\text{GC}(s) = \frac{|\{i : s_i \in \{\texttt{G}, \texttt{C}\}\}|}{n}$. GC content varies across species (from $\sim$25% in some bacteria to $\sim$75% in others) and across regions of the same genome. It affects sequencing quality and DNA stability, since G–C pairs form three hydrogen bonds versus two for A–T pairs.
Definition (k-mer). A $k$-mer is a substring of length $k$. A sequence of length $n$ contains exactly $n - k + 1$ $k$-mers (with overlaps). The $k$-mer spectrum of a sequence is the multiset of all its $k$-mers. Over the alphabet $\Sigma$, there are $4^k$ distinct $k$-mers. $k$-mers are a fundamental data structure in genome assembly, read mapping, and error correction.
Sequencing Technologies
The sequencing problem. A sequencer cannot read a whole chromosome in one pass. Instead, it produces a large number of short reads: substrings sampled (approximately at random) from the genome. Reconstruction of the original genome from these reads is the assembly problem. The key parameters of a sequencing technology are: read length, throughput (total bases per run), error rate, and error profile (substitutions vs. insertions/deletions).
Short-read sequencing (Illumina). The dominant technology since $\sim$2007. DNA is fragmented, adapters are ligated to both ends, and fragments are amplified on a flow cell by bridge PCR, forming clusters. Sequencing proceeds by synthesis: at each cycle, fluorescently labeled reversible-terminator nucleotides are incorporated, imaged, and cleaved. This yields reads of 100–300 bp with very low error rates ($\sim$0.1%), primarily substitution errors. A single run produces hundreds of gigabases of data. The short read length is the main limitation: it makes it hard to assemble repetitive regions longer than a read.
Long-read sequencing (PacBio, Oxford Nanopore). Two platforms produce reads of 10,000–1,000,000 bp, enabling assembly across repetitive regions. PacBio (SMRT sequencing) observes polymerase activity in zero-mode waveguides; Oxford Nanopore measures ionic current changes as DNA passes through a protein pore. Both have higher error rates ($\sim$1–15%) concentrated in homopolymer runs (consecutive identical bases), but sequencing errors are largely random and can be corrected by high coverage or hybrid assembly with short reads.
Sequence file formats: FASTA and FASTQ. Sequencing data and reference genomes are exchanged in two plain-text formats. FASTA stores sequences without quality information: each entry is a header line beginning with > followed by an arbitrary identifier, then the sequence on subsequent lines. FASTQ stores reads with per-base quality scores: four lines per read - a header beginning with @, the sequence, a separator line +, and one ASCII-encoded quality character per base. The Phred quality score at a base is $Q = -10 \log_{10} P_{\text{err}}$, where $P_{\text{err}}$ is the estimated probability that the base call is incorrect, encoded in the file as the character $\texttt{chr}(Q + 33)$. So $Q = 20$ corresponds to error probability $10^{-2}$, $Q = 30$ to $10^{-3}$, and $Q = 40$ to $10^{-4}$. Quality scores are essential for filtering low-confidence bases and for weighting evidence in variant calling.
Coverage intuition. Think of tearing a book into short, randomly placed strips. To reconstruct the text you need enough strips so that every word is covered by at least one strip. With only one strip per region you cannot distinguish a smudged word (sequencing error) from the real text; with 30 overlapping strips showing the same word you can be confident. Two complications arise in practice. First, coverage is uneven: some regions are over-represented (biased amplification, high GC content) and others are missed entirely. Second, if the same sentence is repeated in many chapters, strips from those regions are indistinguishable and you cannot tell where each one belongs. These two phenomena -- non-uniform coverage and repeats -- are why the idealized Poisson model below is only an approximation, and why 30$\times$ coverage is needed in practice rather than the $\approx 5\times$ that the model predicts.
Definition (Coverage). Let $G$ be the genome length, $L$ the mean read length, and $N$ the total number of reads. The sequencing depth (or coverage) is $C = \frac{N \cdot L}{G}$. A coverage of $C = 30\times$ means each base is, on average, covered by 30 reads. The Lander–Waterman model treats read start positions as uniform and independent, so the number of reads starting at any given base follows $\text{Binomial}(N, 1/G) \approx \text{Poisson}(N/G)$ for large $G$. A base is covered by read $i$ if that read starts within the window $[b - L + 1, b]$, so the depth at base $b$ is Poisson with mean $NL/G = C$. Under this model, the fraction of the genome covered by at least one read is $1 - e^{-C}$ (probability that a Poisson$(C)$ variable is $\geq 1$), and the probability that a given base has depth exactly $k$ is $\frac{e^{-C} C^k}{k!}$.
Proposition (Coverage for near-complete assembly). Under the Lander–Waterman model, achieving less than $1\%$ uncovered bases requires $C \geq \ln(100) \approx 4.6$. In practice, due to non-uniform coverage and repetitive regions, coverage of $30\times$ (short reads) or $50\times$ (long reads) is typically required for a high-quality assembly.
Proof.
Under the Lander–Waterman model, the probability that a given base is not covered by any read is $e^{-C}$. Setting $e^{-C} \leq 0.01$ gives $-C \leq \ln(0.01) = -\ln(100)$, i.e., $C \geq \ln(100) \approx 4.6$.Exact String Matching
The pattern matching problem. The simplest string query in genomics: given a short pattern $p$ of length $m$ and a long text $t$ of length $n$, find all positions $i$ such that $t[i..i+m-1] = p$. Practical examples include locating restriction enzyme cut sites ($m \approx 6$), checking whether a primer binds to a reference ($m \approx 20$), and finding exact seed matches as a subroutine of read alignment. Because reads may originate from either strand of the double helix, one must always search for both $p$ and its reverse complement $\overline{p}$.
Naïve matching. The naïve algorithm slides a window of length $m$ over $t$ and compares all characters at each position. It runs in $O(nm)$ time and $O(1)$ extra space. For a human genome ($n = 3 \times 10^9$, $m = 20$), this is $6 \times 10^{10}$ comparisons. Early termination on mismatch makes the average case much faster: for a uniformly random DNA string, the expected number of comparisons per window is $O(1)$ since the first character mismatches with probability $3/4$. In the worst case ($t = \texttt{AAA}\cdots$, $p = \texttt{AAA}\cdots\texttt{B}$) every window requires $m$ comparisons. This worst case motivates the KMP and Boyer-Moore algorithms, which guarantee $O(n)$ time.
def rc(seq):
return seq.translate(str.maketrans('ACGT', 'TGCA'))[::-1]
def exact_match(pattern, text):
m, n = len(pattern), len(text)
return [i for i in range(n - m + 1) if text[i:i+m] == pattern]
def match_both_strands(pattern, text):
return exact_match(pattern, text), exact_match(rc(pattern), text)
Indexed search: BWT and FM-index. Scanning the full text for every query is wasteful when the text is fixed (a reference genome) and many queries (millions of reads) will be issued against it. The idea of an index is to preprocess the text once so that subsequent queries take time depending only on the pattern length $m$, not on the text length $n$.
The Burrows–Wheeler transform (BWT) of a string $t$ (terminated by a sentinel symbol $\#$ smaller than every alphabet character) is the last column of the matrix whose rows are all cyclic rotations of $t\#$ sorted lexicographically. Remarkably, $t$ can be recovered exactly from its BWT alone, and together with two small auxiliary tables -- the $C$ array (number of characters in $t$ smaller than each symbol) and rank counts -- the BWT supports backward search: given a pattern $p$ of length $m$, the range of suffix-array positions matching $p$ is computed in $O(m)$ time, independent of $n$. This is the FM-index (Ferragina–Manzini, 2000).
Modern short-read aligners (BWA, Bowtie) build the FM-index of the reference once ($\sim$3 GB for the human genome) and align each read with a handful of BWT operations. The FM-index combines the query speed of a suffix tree with a memory footprint close to that of the raw text.
Sequence Alignment
Definition (Alignment). An alignment of two strings $a = a_1 \cdots a_m$ and $b = b_1 \cdots b_n$ over $\Sigma$ is a placement of the characters of $a$ and $b$ into two rows of a matrix such that: each column contains at least one non-gap character, and the left-to-right order of characters in each row is preserved. A column is a match if both characters are identical, a mismatch if both are non-gap but different, and a gap (or indel) if one character is a gap symbol $-$. An alignment score is assigned by a scoring function $s(\cdot, \cdot)$ that sums contributions over all columns.
Definition (Hamming distance). Let $a$ and $b$ be two strings of equal length $n$ over $\Sigma$. The Hamming distance $d_H(a, b) = |\{i : a_i \neq b_i\}|$ counts positions where the strings differ. It considers only substitutions and is undefined for strings of unequal length. In genomics, it naturally captures the number of sequencing errors between a read and the reference region it covers, under the assumption that the read has no insertions or deletions relative to that region.
def hamming(a, b):
assert len(a) == len(b)
return sum(x != y for x, y in zip(a, b))
Definition (Edit distance). The edit distance (Levenshtein distance) $d(a, b)$ between two strings $a$ and $b$ is the minimum number of single-character operations (insertions, deletions, substitutions) needed to transform $a$ into $b$. It equals the minimum number of mismatches plus gaps in any alignment of $a$ and $b$ with unit costs. The edit distance satisfies the metric axioms:
- $d(a,b) \geq 0$
- $d(a,b) = 0 \iff a = b$
- $d(a,b) = d(b,a)$
- $d(a,c) \leq d(a,b) + d(b,c)$
Why dynamic programming? To align $a$ (length $m$) against $b$ (length $n$) we need to decide, for each character, whether it is matched to a character of the other string or to a gap. The number of possible alignments is exponential: roughly $\binom{m+n}{m}$, which grows faster than $2^m$ for comparable lengths. Enumerating them all is hopeless.
The saving observation is optimal substructure: the last column of any alignment of $a[1..i]$ with $b[1..j]$ is one of exactly three things.
- $a_i$ is aligned with $b_j$ (a match or mismatch).
- $a_i$ is aligned with a gap on the $b$ side.
- $b_j$ is aligned with a gap on the $a$ side.
In every case, removing that last column leaves an alignment of a strictly shorter prefix pair: $(a[1..i-1], b[1..j-1])$, $(a[1..i-1], b[1..j])$, or $(a[1..i], b[1..j-1])$ respectively. If the original alignment is optimal, the leftover prefix alignment must also be optimal for its own prefix pair (otherwise we could swap in a better prefix alignment and improve the total score -- a contradiction). This is the exchange argument that justifies recursion.
So define $F(i, j)$ as the score of the best alignment of $a[1..i]$ with $b[1..j]$. We can compute $F(i,j)$ from three smaller subproblems, building the table from top-left to bottom-right. Computing the whole $(m+1)\times(n+1)$ table takes $O(mn)$ time and space.
Theorem (Needleman–Wunsch, 1970). Let $a = a_1 \cdots a_m$ and $b = b_1 \cdots b_n$. Define the scoring function: $s(x, y) = +1$ if $x = y$, $-1$ if $x \neq y$, and gap penalty $-d$ for each gap. Let $F(i,j)$ be the score of the best alignment of $a[1..i]$ with $b[1..j]$. The three cases identified above translate directly into the recurrence
- $F(i, 0) = -i \cdot d$ (aligning $i$ characters of $a$ against $i$ gaps)
- $F(0, j) = -j \cdot d$ (aligning $j$ characters of $b$ against $j$ gaps)
- $F(i, j) = \max\{F(i-1,\, j-1) + s(a_i, b_j),\; F(i-1,\, j) - d,\; F(i,\, j-1) - d\}$
The three terms correspond exactly to cases 1, 2, and 3 above. The optimal global alignment score is $F(m, n)$.
Proof (correctness of the recurrence).
Consider an optimal alignment of $a[1..i]$ and $b[1..j]$. Its last column is one of three cases:(1) $a_i$ aligned with $b_j$ - the score is $F(i-1, j-1) + s(a_i, b_j)$;
(2) $a_i$ aligned with a gap - the score is $F(i-1, j) - d$;
(3) $b_j$ aligned with a gap - the score is $F(i, j-1) - d$. The prefix alignment in each case is itself optimal (if it were not, we could substitute a better prefix alignment and improve the total score, contradicting optimality). Thus $F(i,j)$ is the maximum over these three cases.
def needleman_wunsch(a, b, match=1, mismatch=-1, gap=-1):
m, n = len(a), len(b)
dp = [[i * gap] + [0] * n for i in range(m + 1)]
dp[0] = [j * gap for j in range(n + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
sub = match if a[i-1] == b[j-1] else mismatch
dp[i][j] = max(
dp[i-1][j-1] + sub, dp[i-1][j] + gap, dp[i][j-1] + gap
)
ra, rb, i, j = [], [], m, n
while i or j:
sub = match if i and j and a[i-1] == b[j-1] else mismatch
if i and j and dp[i][j] == dp[i-1][j-1] + sub:
ra.append(a[i-1])
rb.append(b[j-1])
i -= 1
j -= 1
elif i and dp[i][j] == dp[i-1][j] + gap:
ra.append(a[i-1])
rb.append('-')
i -= 1
else:
ra.append('-')
rb.append(b[j-1])
j -= 1
return ''.join(reversed(ra)), ''.join(reversed(rb)), dp[m][n]
From global to local alignment. Needleman–Wunsch aligns all of $a$ against all of $b$. End gaps are penalized just like interior gaps. This is the right model when two sequences are known to be related along their full length (e.g. two homologous genes of similar size).
For many biological questions the situation is different: $a$ might be a short query (a read, a protein domain) and $b$ a long database sequence, or both sequences may share only a conserved region flanked by unrelated sequence. In these cases, forcing a full alignment penalizes the unrelated flanking regions heavily and yields a meaningless score. What we want is to find the pair of substrings $(a[i..i'],\, b[j..j'])$ whose alignment score is highest -- ignoring everything outside them.
The trick to turning NW into a local aligner is simple: we allow the alignment score to "restart" at any position by adding $0$ as a fourth option in the recurrence. Concretely, define $H(i,j)$ as the score of the best alignment ending at $a_i$ and $b_j$, or $0$ if no alignment ending here has a positive score. Then extending from a cell with $H = 0$ is the same as starting a fresh alignment at that point. The best local alignment score is $\max_{i,j} H(i,j)$, found anywhere in the table, not just at the bottom-right corner. Traceback starts at that maximum and stops as soon as it reaches a cell with $H = 0$ -- that cell marks the start of the local alignment.
Definition (Local alignment, Smith–Waterman). A local alignment of $a$ and $b$ is a global alignment of a substring $a[i..i']$ with a substring $b[j..j']$, chosen to maximize the alignment score. The Smith–Waterman algorithm computes the optimal local alignment score using the recurrence $$H(i, j) = \max\begin{cases} 0 & \text{(start fresh)} \\ H(i-1,\, j-1) + s(a_i, b_j) & \text{(case 1: match/mismatch)} \\ H(i-1,\, j) - d & \text{(case 2: gap in } b\text{)} \\ H(i,\, j-1) - d & \text{(case 3: gap in } a\text{)} \end{cases}$$ The first three cases are identical to Needleman–Wunsch. The only addition is the $0$ option: if extending any alignment to position $(i,j)$ would give a negative score, it is better to declare that no useful alignment ends here and start over. The optimal local alignment score is $\max_{i,j} H(i,j)$, and the traceback starts from this maximum and stops when a cell with value $0$ is reached.
def smith_waterman(a, b, match=2, mismatch=-1, gap=-1):
m, n = len(a), len(b)
dp = [[0] * (n + 1) for _ in range(m + 1)]
best, pos = 0, (0, 0)
for i in range(1, m + 1):
for j in range(1, n + 1):
sub = match if a[i-1] == b[j-1] else mismatch
dp[i][j] = max(
0, dp[i-1][j-1] + sub, dp[i-1][j] + gap, dp[i][j-1] + gap
)
if dp[i][j] > best:
best, pos = dp[i][j], (i, j)
ra, rb = [], []
i, j = pos
while dp[i][j]:
sub = match if a[i-1] == b[j-1] else mismatch
if dp[i][j] == dp[i-1][j-1] + sub:
ra.append(a[i-1])
rb.append(b[j-1])
i -= 1
j -= 1
elif dp[i][j] == dp[i-1][j] + gap:
ra.append(a[i-1])
rb.append('-')
i -= 1
else:
ra.append('-')
rb.append(b[j-1])
j -= 1
return ''.join(reversed(ra)), ''.join(reversed(rb)), best
Affine gap penalties. The linear gap penalty $-d$ per gap column (used above) treats a gap of length $\ell$ as $\ell$ independent gaps. Biologically this is wrong: a single insertion of 10 bases is one mutational event, far more likely than 10 separate single-base insertions. An affine gap penalty charges a fixed cost $-o$ to open a gap plus a smaller cost $-e$ to extend it by one additional base, so a gap of length $\ell$ costs $o + (\ell - 1) e$ with $o \gg e$ (typical values: $o = 10$, $e = 1$).
Implementing affine penalties requires tracking whether the previous column was already inside a gap, which the single-table NW/SW recurrence cannot do. The fix is three tables: $M(i,j)$ for alignments ending in a match/mismatch, $X(i,j)$ for alignments ending in a gap in $b$ (consuming a character of $a$), and $Y(i,j)$ for alignments ending in a gap in $a$. Each table's recurrence distinguishes opening a gap (transition from $M$, costing $-o$) from extending a gap (staying in $X$ or $Y$, costing $-e$). The total complexity remains $O(mn)$ in both time and space. This is the Gotoh algorithm (1982); it is the default in modern aligners (BWA, minimap2) and BLAST.
From exact to heuristic alignment: BLAST. Needleman–Wunsch and Smith–Waterman are $O(mn)$. When $m = 300$ (a read) and $n = 3 \times 10^9$ (the human genome), this is $9 \times 10^{11}$ operations per read - far too slow. BLAST (Basic Local Alignment Search Tool) accelerates local alignment by a seed-and-extend heuristic:
(1) find all exact matches of length $W$ (seeds) between the query and the database,
(2) extend each seed in both directions as long as the local alignment score stays above a threshold. This misses alignments with no exact $W$-mer match, but in practice very few biologically meaningful alignments are missed when $W = 11$. Modern short-read aligners (BWA, Bowtie2) use different data structures (FM-index, Burrows–Wheeler transform) to achieve $O(n + m)$ alignment.
Genome Assembly
The assembly problem. Given a set of reads $R = \{r_1, \ldots, r_N\}$ sampled from an unknown genome $G$, reconstruct $G$. This is fundamentally ambiguous: repetitive regions of length greater than a read cannot be resolved from the reads alone. In practice, the output is not a single sequence but a set of contigs (contiguous assembled sequences) and scaffolds (contigs linked by mate-pair information, with gaps of estimated size). Assembly quality is measured by the N50: the length $L$ such that contigs of length $\geq L$ cover at least half the total assembled sequence.
Definition (Overlap graph). Given a set of reads $R$ and a minimum overlap length $\ell$, the overlap graph $G_R = (R, E)$ is a directed graph where there is an edge from $r_i$ to $r_j$ if the suffix of $r_i$ of length $\geq \ell$ matches (approximately) the prefix of $r_j$. The edge weight is the length of the overlap. To reconstruct the genome: if we can find an ordering of all reads $r_{\pi(1)}, r_{\pi(2)}, \ldots, r_{\pi(N)}$ such that each consecutive pair overlaps, then concatenating them (removing the overlapping portion at each junction) spells out the genome. Such an ordering is exactly a Hamiltonian path in $G_R$ -- a path visiting every read (vertex) exactly once. Finding a Hamiltonian path is NP-hard in general, which motivates alternative formulations.
def overlap(r1, r2, min_len):
return next((k for k in range(min(len(r1), len(r2)), min_len - 1, -1)
if r1[-k:] == r2[:k]), 0)
def overlap_graph(reads, min_len):
return {
i: [(j, k) for j, r2 in enumerate(reads)
if i != j and (k := overlap(r1, r2, min_len)) > 0]
for i, r1 in enumerate(reads)
}
The de Bruijn idea: switching from reads to k-mers. The overlap graph approach has two problems. First, computing all pairwise overlaps between $N$ reads costs $O(N^2)$, which is prohibitive for millions of reads. Second, finding a Hamiltonian path is NP-hard.
The de Bruijn graph sidesteps both by changing what the vertices and edges represent. Instead of tracking reads, we track k-mers. The key observation is: two consecutive k-mers in a sequence necessarily share a $(k-1)$-mer overlap -- the suffix of the first equals the prefix of the second. For example, with $k = 3$: the k-mers ACG and CGT share the overlap CG.
We exploit this as follows: make each distinct $(k-1)$-mer a node, and represent each k-mer as a directed edge from its length-$(k-1)$ prefix to its length-$(k-1)$ suffix. The genome then corresponds to a path that traverses every edge (k-mer) exactly once -- an Eulerian path. Finding an Eulerian path takes $O(|E|)$ time (Hierholzer's algorithm), a dramatic improvement over the NP-hard Hamiltonian path problem.
Recovering the sequence from the path is immediate: consecutive nodes in the path share a $(k-2)$-mer overlap, so the assembled sequence is the first node concatenated with the last character of each subsequent node (see the assemble function below).
Definition (de Bruijn graph). Let $k \geq 1$. The de Bruijn graph $B_k(R)$ of a read set $R$ is a directed multigraph defined as follows. Each node is a distinct $(k-1)$-mer appearing in the reads. For each $k$-mer $w = w_1 \cdots w_k$ in the reads, add a directed edge from node $w_1 \cdots w_{k-1}$ to node $w_2 \cdots w_k$, labeled $w$. The assembly problem in this graph reduces to finding an Eulerian path -- a path traversing every edge exactly once -- which can be solved in $O(|E|)$ time.
Proposition (Eulerian path condition in de Bruijn graphs). A directed graph has an Eulerian path if and only if the graph is connected (on vertices with nonzero degree) and at most one vertex has $\text{out-degree} - \text{in-degree} = 1$ (the start), at most one has $\text{in-degree} - \text{out-degree} = 1$ (the end), and all other vertices have equal in- and out-degree.
For a de Bruijn graph built from a single error-free linear genome $g_1 g_2 \cdots g_n$ with sufficient coverage (every k-mer appears in at least one read), these conditions hold for the following reason. Every interior position $i$ ($k \leq i \leq n-k+1$) produces a k-mer $g_{i} \cdots g_{i+k-1}$ whose prefix node $g_i \cdots g_{i+k-2}$ receives one incoming edge and whose suffix node $g_{i+1} \cdots g_{i+k-1}$ has one outgoing edge. Since each interior $(k-1)$-mer is both the suffix of the k-mer ending there and the prefix of the k-mer starting there, it has in-degree $=$ out-degree $= 1$. The only exceptions are the very first $(k-1)$-mer $g_1 \cdots g_{k-1}$ (no k-mer ends before position 1, so it has in-degree 0, out-degree 1) and the very last $(k-1)$-mer $g_{n-k+2} \cdots g_n$ (no k-mer starts after position $n-k+1$, so it has in-degree 1, out-degree 0). Connectivity follows from the linear ordering of the genome. The unique Eulerian path from the first to the last $(k-1)$-mer then spells out exactly $g$.
Repeats break uniqueness: a repeated $(k-1)$-mer appears as a single node with in-degree $= $ out-degree $> 1$, so the degree condition still holds but multiple Eulerian paths exist, some of which spell out incorrect assemblies. Errors introduce false k-mer edges whose extra in/out degrees disturb the balance, typically producing a non-Eulerian graph that requires additional resolution steps.
from collections import Counter, defaultdict
def debruijn_graph(reads, k):
graph = defaultdict(list)
for read in reads:
for i in range(len(read) - k + 1):
graph[read[i:i+k-1]].append(read[i+1:i+k])
return graph
def eulerian_path(graph):
in_deg = Counter(v for vs in graph.values() for v in vs)
out_deg = {u: len(vs) for u, vs in graph.items()}
start = next(
(u for u in graph if out_deg[u] - in_deg[u] == 1), next(iter(graph))
)
adj = {u: list(vs) for u, vs in graph.items()}
stack, path = [start], []
while stack:
if adj.get(stack[-1]):
stack.append(adj[stack[-1]].pop())
else:
path.append(stack.pop())
return path[::-1]
def assemble(reads, k):
path = eulerian_path(debruijn_graph(reads, k))
return path[0] + ''.join(node[-1] for node in path[1:])
Genome Annotation
What is annotation. A raw genome sequence is just a string. Genome annotation is the process of identifying the location and function of its elements: protein-coding genes, non-coding RNAs, regulatory sequences, and repetitive elements. It proceeds in two phases: structural annotation (where are the genes?) and functional annotation (what do they do?). The two main strategies for gene finding are ab initio prediction (from sequence signals alone) and homology-based prediction (by similarity to known genes).
The genetic code and reading frames. The information in a gene is read in consecutive non-overlapping triplets called codons. Each codon specifies one amino acid in the resulting protein, or signals where to start and stop translation. There are $4^3 = 64$ possible codons but only 20 amino acids; most amino acids are encoded by multiple codons (the code is redundant). Two special roles:
- Start codon:
ATG(encodes methionine). Translation begins here. - Stop codons:
TAA,TAG,TGA. They encode no amino acid; they terminate translation.
Since codons are triplets, there are three possible ways to partition a sequence into consecutive triplets depending on where you start: positions $\{0, 3, 6, \ldots\}$, $\{1, 4, 7, \ldots\}$, or $\{2, 5, 8, \ldots\}$. Each choice is called a reading frame. Since a gene can lie on either strand of the double helix, there are $3 \times 2 = 6$ reading frames to consider for any given sequence.
Definition (Open reading frame). An open reading frame (ORF) is a contiguous stretch of codons beginning with a start codon (ATG) and ending with a stop codon (TAA, TAG, or TGA), with no in-frame stop codon in between. In a random DNA string, a stop codon appears roughly every $\frac{4^3}{3} \approx 21$ codons by chance; so a long uninterrupted stretch of codons is unlikely to arise by accident and is strong evidence of a real gene. In prokaryotes, long ORFs ($\geq 100$ codons) are indeed strong evidence of a protein-coding gene. In eukaryotes, genes are interrupted by introns -- non-coding segments that are spliced out before translation -- so ORF finding alone is insufficient.
STOP_CODONS = {'TAA', 'TAG', 'TGA'}
def find_orfs(seq, min_codons=100):
orfs = []
for frame in range(3):
codons = [seq[i:i+3] for i in range(frame, len(seq) - 2, 3)]
starts = [i for i, c in enumerate(codons) if c == 'ATG']
for start in starts:
stop = next(
(i for i in range(start + 1, len(codons))
if codons[i] in STOP_CODONS),
None,
)
if stop and stop - start >= min_codons:
orfs.append((frame + start * 3, frame + (stop + 1) * 3, frame))
return orfs
def reverse_complement(seq):
return seq.translate(str.maketrans('ACGT', 'TGCA'))[::-1]
def find_all_orfs(seq, min_codons=100):
rc = reverse_complement(seq)
reverse = [(len(seq) - e, len(seq) - s, frame + 3)
for s, e, frame in find_orfs(rc, min_codons)]
return find_orfs(seq, min_codons) + reverse
Ab initio gene prediction. In eukaryotes, gene structure is complex: exons (coding), introns (non-coding, spliced out), UTRs, and splice signals. Ab initio predictors (Augustus, GeneMark) model these signals using hidden Markov models (HMMs). States represent genomic features (exon, intron, intergenic, splice donor, splice acceptor), and emission probabilities capture codon usage bias and splice consensus sequences. The Viterbi algorithm finds the most likely state sequence - and hence the most likely gene structure - in $O(n \cdot |S|^2)$ time, where $|S|$ is the number of states.
Functional annotation. Once gene locations are predicted, each gene product is assigned a function by similarity search against curated databases: UniProt/Swiss-Prot (manually curated protein sequences), Pfam (protein domain families modeled as profile HMMs), and the Gene Ontology (GO) - a controlled vocabulary of biological process, molecular function, and cellular component terms. A BLAST or HMMer search transfers annotations from known homologs to the new gene. The reliability of transferred annotations decreases as sequence identity drops below $\sim$30%.
Variant Calling
Definition (Variant types). A variant is a position where a sequenced sample differs from a reference genome. Single nucleotide polymorphisms (SNPs) are single-base substitutions. Insertions and deletions (indels) are short additions or removals of bases (typically $< 50$ bp). Structural variants (SVs) are large rearrangements: deletions, duplications, inversions, translocations ($\geq 50$ bp). SNPs and indels are the most abundant: a typical human genome differs from the reference at $\sim 4 \times 10^6$ SNP positions.
The variant calling pipeline. The standard pipeline has three stages.
(1) Read mapping: align reads to the reference genome using a short-read aligner (BWA-MEM, Bowtie2). Each read is placed at its best matching position; the result is a BAM file - a sorted, indexed binary alignment.
(2) Pileup: for each reference position, collect all reads covering that position and count the observed bases.
(3) Genotyping: apply a statistical model to decide whether an observed deviation from the reference is a true variant or a sequencing error. The industry-standard tool is GATK HaplotypeCaller, which uses a local assembly step to handle indels cleanly.
Diploid genomes and alleles. Humans (and most animals) are diploid: every cell carries two copies of each chromosome, one inherited from each parent. At any given genomic position, the two copies may carry the same base (the individual is homozygous) or different bases (the individual is heterozygous). Call the base in the reference genome $R$ and any differing base $A$ (alternative). A genotype at a position is one of $RR$, $RA$, or $AA$. Each sequenced read comes from one of the two chromosomal copies chosen at random with equal probability.
Definition (Genotype likelihood). Let a position be covered by $n$ reads, of which $k$ show the alternative allele and $n - k$ show the reference allele. Let $\epsilon$ be the per-base sequencing error rate. The likelihood of the observed data under each genotype is computed as follows:
- $P(\text{data} \mid RR) = (1-\epsilon)^{n-k} \cdot \epsilon^k$ -- both copies carry $R$, so each read should show $R$ with probability $1-\epsilon$ and $A$ only by error.
- $P(\text{data} \mid RA) = \left(\tfrac{1}{2}\right)^n$ -- one copy carries $R$ and the other carries $A$; each read comes from one of the two copies with probability $\tfrac{1}{2}$, so each read shows $R$ or $A$ each with probability $\tfrac{1}{2}$, regardless of how many show $R$ vs $A$. The count $k$ carries no information about this genotype.
- $P(\text{data} \mid AA) = \epsilon^{n-k} \cdot (1-\epsilon)^k$ -- both copies carry $A$, so each read should show $A$ with probability $1-\epsilon$ and $R$ only by error.
The called genotype is the one maximizing the posterior $P(\text{genotype} \mid \text{data}) \propto P(\text{data} \mid \text{genotype}) \cdot P(\text{genotype})$. The prior $P(\text{genotype})$ encodes how common variants are: variants at known SNP sites (from databases like dbSNP) get a higher prior for $RA$ or $AA$; at novel positions the prior strongly favors $RR$ since most positions are not variant. Variants are reported in VCF (Variant Call Format), a tab-separated text file with one line per variant.
from math import log
def genotype_likelihoods(n, k, error=1e-3):
return {
'RR': (n-k) * log(1 - error) + k * log(error),
'RA': n * log(0.5),
'AA': k * log(1 - error) + (n-k) * log(error),
}
def call_genotype(n, k, error=1e-3):
lls = genotype_likelihoods(n, k, error)
return max(lls, key=lls.get)
Comparative Genomics
Goals of comparative genomics. Comparing genomes across species reveals which regions are evolutionarily conserved (and therefore likely functional), infers the evolutionary history of genes and species, and identifies gene gains, losses, and rearrangements. The two central problems are multiple sequence alignment (aligning more than two sequences simultaneously) and phylogenetic tree inference (reconstructing the evolutionary tree from sequence data).
Definition (Multiple sequence alignment). A multiple sequence alignment (MSA) of $k$ sequences $s^{(1)}, \ldots, s^{(k)}$ is a matrix with $k$ rows and $c$ columns, where each row is a gapped version of one sequence and each column contains at least one non-gap character. The sum-of-pairs score is $\sum_{1 \leq i < j \leq k} \text{score}(s^{(i)}, s^{(j)})$ summed over all pairwise alignments induced by the MSA columns. Optimal MSA under sum-of-pairs is NP-hard for $k \geq 3$; in practice, progressive methods (MUSCLE, CLUSTALW) are used: they build a guide tree from pairwise distances and merge alignments bottom-up.
Why raw sequence divergence underestimates evolutionary distance. Suppose we compare two sequences and find that 30% of sites differ. The naive estimate of evolutionary distance is "0.3 substitutions per site." But this is too low, for two reasons.
First, back mutations: a site might mutate from A to T and then back to A. Those two mutations are invisible to us because the site looks identical, yet two events occurred. Second, multiple hits: a site mutates from A to T, then from T to G. We observe a difference (A vs G) and count it as one event, but two actually occurred. Both effects cause the observed fraction of differing sites to saturate as sequences diverge further -- the curve flattens toward the random maximum of 75% (since at equilibrium with equal frequencies, 4 equally probable bases, 3 out of 4 random pairs will differ). The true number of mutations is unbounded.
A correction model treats each site as a Markov chain over the four bases and derives the expected fraction of identical sites as a function of the true number of mutations per site $d$. Inverting that function gives a corrected distance estimate from the observed divergence.
Definition (Jukes–Cantor model). The Jukes–Cantor model is the simplest such correction model. Each site evolves independently as a continuous-time Markov chain on $\{A, C, G, T\}$ with all off-diagonal substitution rates equal. Let $d$ denote the expected number of substitutions per site. Solving the rate equation gives the probability that a site is identical between the two sequences as $p = \frac{1}{4} + \frac{3}{4} e^{-4d/3}$. (At $d = 0$ it is 1; as $d \to \infty$ it decays to $\frac{1}{4}$, the random baseline for 4 equally likely bases.) Inverting, the Jukes–Cantor distance estimated from the observed fraction $p_{\text{obs}}$ of differing sites is: $$d = -\frac{3}{4} \ln\!\left(1 - \frac{4}{3} p_{\text{obs}}\right)$$
Definition (Phylogenetic tree, UPGMA). A phylogenetic tree is a tree whose leaves are the sequences (taxa) and whose internal nodes represent ancestral sequences. Edge lengths represent evolutionary distances. UPGMA (Unweighted Pair Group Method with Arithmetic mean) is the simplest tree-building algorithm: at each step, merge the two clusters with the smallest average pairwise distance, and set the branch length to half that distance. It produces an ultrametric tree (all leaves at equal distance from the root), which is correct only under a molecular clock assumption. Neighbor-joining relaxes this assumption and is more widely used in practice.
Definition (Neighbor-joining). Neighbor-joining (NJ, Saitou–Nei 1987) does not assume a molecular clock: it produces an unrooted tree whose branch lengths reflect the actual distances rather than equal distance from a putative root. Given the distance matrix $d$ on $n$ current taxa, at each step NJ picks the pair $(i, j)$ minimizing the $Q$-criterion $$Q(i, j) = (n - 2)\, d(i, j) - \sum_{k} d(i, k) - \sum_{k} d(j, k),$$ which favors pairs that are simultaneously close to each other and far from everything else (this is what distinguishes true neighbors from pairs that merely have small pairwise distance because both are close to some third taxon). It joins $i$ and $j$ at a new internal node $u$, sets the branch lengths from $u$ to $i$ and $j$ from the row sums of $d$, then updates the matrix with $d(u, k) = \tfrac{1}{2}(d(i, k) + d(j, k) - d(i, j))$ and recurses on $n - 1$ taxa. NJ runs in $O(n^3)$ time and is the most widely used distance-based tree-building method. When the input distances exactly fit a tree (the additive case), NJ provably recovers it.
from math import log, inf
def jukes_cantor(s1, s2):
p = sum(a != b for a, b in zip(s1, s2)) / len(s1)
return -0.75 * log(1 - 4 * p / 3) if p < 0.75 else inf
def upgma(seqs):
n = len(seqs)
clusters = {i: [i] for i in range(n)}
dist = {
(i, j): jukes_cantor(seqs[i], seqs[j])
for i in range(n)
for j in range(i+1, n)
}
merges, next_id = [], n
while len(clusters) > 1:
i, j = min(dist, key=dist.get)
merges.append((i, j, dist[i, j] / 2, next_id))
new = clusters[i] + clusters[j]
for k in set(clusters) - {i, j}:
dist[min(next_id, k), max(next_id, k)] = (
len(clusters[i]) * dist[min(i, k), max(i, k)] +
len(clusters[j]) * dist[min(j, k), max(j, k)]
) / len(new)
dist = {p: d for p, d in dist.items() if i not in p and j not in p}
del clusters[i], clusters[j]
clusters[next_id] = new
next_id += 1
return merges