Computational Genomics
Finding the Replication Origin
Before a cell divides it copies its genome, and copying always starts at a fixed location called the origin of replication (ori). Locating ori in a raw genome string turns out to be a clean computational problem that can be solved by counting just two of the four nucleotides.
As replication proceeds, one template strand spends a long time single-stranded, and single-stranded cytosine (C) slowly mutates into thymine (T). That strand gradually loses its C's, so reading along it G ends up outnumbering C. The imbalance points in one direction on the way into ori and the other on the way out, which is what lets us find it.
Define the skew at each position as the running count of G minus the running count of C, scanning left to right: add $1$ for each G and subtract $1$ for each C. The skew falls along the C-rich stretch leading into ori and rises along the G-rich stretch coming out of it, so the minimum of the skew marks ori.
Press play to walk along a synthetic genome and watch the curve take shape. Green bases (G) push it up, red bases (C) pull it down, and the orange marker tracks the lowest point seen so far, which is the predicted ori. Use New genome for a fresh sequence.
On a real genome the curve is noisier but shows the same overall valley, and its minimum lands very close to the true origin. This single $O(n)$ scan is enough to narrow ori down to a small region, with no k-mer counting required.
Greedy Motif Search
A transcription factor controls many genes by binding a short pattern, a motif, that recurs with mutations in the regulatory region upstream of each gene. Given one such region per gene as a collection of DNA strings, the goal is to recover the shared motif of length $k$.
A candidate solution is one $k$-mer chosen from each string, aligned into a matrix with one row per string. The score counts, column by column, the letters that differ from the most common one, so a well-conserved motif scores low. We want the candidate with the smallest score, but checking every combination is exponential in the number of strings.
Greedy motif search gives up that guarantee for speed. From a single starting $k$-mer it builds one candidate by scanning the strings in order:
- Take a $k$-mer of the first string as the seed; it becomes the first row of the candidate.
- Count each base in each column of the rows chosen so far, then turn the counts into a profile of probabilities: an entry is $(\text{count}+1)/(m+4)$ once $m$ rows are chosen, the added pseudocount keeping any letter from reaching probability zero.
- Move to the next string and score every $k$-mer by multiplying its profile entries, one per column; add the $k$-mer with the largest product.
- Repeat steps 2 and 3 down the remaining strings to complete the candidate, then score it.
The whole process runs once for every $k$-mer of the first string as the seed, and the lowest-scoring candidate found over all seeds is the answer.