Bioinformatics Algorithms

Chapter 1: Where in the Genome Does Replication Begin?

This chapter frames a fundamental biological problem as a string search problem: given the text of a genome, where does DNA replication start? The algorithms developed here (k-mer counting, pattern matching, approximate search) are foundational to the rest of the course.

DNA: The Basics

DNA is the molecule that encodes the instructions for building and running a cell. It is made of four nucleotides: Adenine (A), Thymine (T), Cytosine (C), and Guanine (G). From a CS perspective, a DNA strand is a string over the alphabet {A, T, C, G}.

DNA is double-stranded: two strands are bonded together. The pairing is fixed by chemistry: A always pairs with T, and C always pairs with G.

The two strands are antiparallel: they run in opposite directions. Each strand has a chemical directionality defined by its ends, called 5' and 3'. By convention, DNA sequences are always written 5' to 3'. So if one strand reads 5'-ATCG-3', the paired strand reads 3'-TAGC-5', which in standard 5'-to-3' notation is CGAT.

To get the complementary strand in 5'-to-3' notation: reverse the original strand, then swap each base with its complement (A↔T, C↔G). This is called the reverse complement.

COMPLEMENT = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'}

def reverse_complement(pattern):
    return ''.join(COMPLEMENT[b] for b in reversed(pattern))

For example: reverse_complement("ATCGTA") → "TACGAT".

DNA Replication and the Origin

Before a cell divides, it must copy its entire genome. This process is called DNA replication. Each strand of the double helix serves as a template for synthesizing a new complementary strand, producing two identical copies of the original. Watson and Crick noted this in their 1953 landmark paper on DNA structure: "it has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material."

Replication does not start at a random position. It begins at a specific region called the origin of replication (ori). A protein called DnaA initiates replication by binding to short sequences near ori called DnaA boxes. A DnaA box is essentially a short message written in the DNA that tells DnaA "bind here". In bacteria, DnaA boxes are roughly 9-mers that appear multiple times clustered within a short stretch of DNA near ori.

Understanding ori has practical applications. In gene therapy, a patient is deliberately infected with a viral vector carrying an engineered gene that encodes a therapeutic protein. Such vectors carry their own ori to hijack the host cell's replication machinery, so knowing where and how replication begins is essential for engineering them.

Finding the Origin: Frequent k-mers

The computational question is: given a genome string, find short substrings (k-mers) that appear unusually often in a region. Those are candidate DnaA boxes.

A first primitive: find all positions where a pattern occurs exactly in a text.

def pattern_match(text, pattern):
    k = len(pattern)
    return [i for i in range(len(text) - k + 1) if text[i:i+k] == pattern]

This is $O(n \cdot k)$ where $n = |text|$. To find the most frequent k-mers, slide a window across the text and count each k-mer in a dictionary:

def frequent_words_naive(text, k):
    counts = {}
    for i in range(len(text) - k + 1):
        kmer = text[i:i+k]
        counts[kmer] = counts.get(kmer, 0) + 1
    max_count = max(counts.values())
    return [kmer for kmer, c in counts.items() if c == max_count]

This is also $O(n \cdot k)$: there are $n - k + 1$ windows, and slicing each k-mer costs $O(k)$.

An alternative is a frequency array of size $4^k$ instead of a dictionary. Each k-mer is mapped to a unique integer index in $[0,\, 4^k - 1]$ by treating the string as a base-4 number (A=0, C=1, G=2, T=3). This gives $O(1)$ index access rather than a hash lookup, and makes the index encoding explicit.

BASE  = {'A': 0, 'C': 1, 'G': 2, 'T': 3}
BASES = 'ACGT'

def kmer_to_index(kmer):
    index = 0
    for base in kmer:
        index = 4 * index + BASE[base]
    return index

def index_to_kmer(index, k):
    kmer = []
    for _ in range(k):
        kmer.append(BASES[index % 4])
        index //= 4
    return ''.join(reversed(kmer))

def frequent_words(text, k):
    freq = [0] * (4 ** k)
    for i in range(len(text) - k + 1):
        freq[kmer_to_index(text[i:i+k])] += 1
    max_count = max(freq)
    return [index_to_kmer(i, k) for i, c in enumerate(freq) if c == max_count]

Important: we must search both strands. A DnaA box on one strand has its reverse complement on the other, and DnaA can bind either. In practice, we count both a k-mer and its reverse complement together.

The Clump Finding Problem

In a real bacterial genome, ori is a region of roughly 500 base pairs in which DnaA boxes cluster tightly. The clump finding problem formalizes this: find all k-mers that appear at least $t$ times within some window of length $L$.

A k-mer forms an $(L, t)$-clump in genome if there exists a window of length $L$ containing at least $t$ occurrences of that k-mer.

def find_clumps_naive(genome, kmer_length, window_length, min_occurrences):
    clumps = set()
    for window_start in range(len(genome) - window_length + 1):
        window = genome[window_start:window_start + window_length]
        counts = {}
        for offset in range(window_length - kmer_length + 1):
            kmer = window[offset:offset + kmer_length]
            counts[kmer] = counts.get(kmer, 0) + 1
            if counts[kmer] >= min_occurrences:
                clumps.add(kmer)
    return clumps

This naive implementation is $O(n \cdot L \cdot k)$: it rebuilds the count table from scratch for every window. The fix is to keep a single count table and slide the window one position at a time, updating only at the edges: drop the k-mer leaving on the left, add the k-mer entering on the right. Each slide does $O(k)$ work, giving $O(n \cdot k)$ overall.

def find_clumps(genome, kmer_length, window_length, min_occurrences):
    clumps = set()
    counts = {}
    for kmer_start in range(len(genome) - kmer_length + 1):
        # The k-mer entering on the right of the current window.
        entering = genome[kmer_start:kmer_start + kmer_length]
        counts[entering] = counts.get(entering, 0) + 1

        # Once the window is full, drop the k-mer that fell off the left.
        leaving_start = kmer_start - (window_length - kmer_length)
        if leaving_start > 0:
            leaving = genome[leaving_start - 1:leaving_start - 1 + kmer_length]
            counts[leaving] -= 1

        if counts[entering] >= min_occurrences:
            clumps.add(entering)
    return clumps

On real genomes, clump finding returns thousands of candidates and is not specific enough on its own. The next section provides a complementary approach that narrows down ori directly.

Replication Asymmetry and the Skew

How Replication Works

In bacteria, the chromosome is circular. Replication starts at ori and proceeds in both directions simultaneously, creating two replication forks that travel around the chromosome until they meet at the terminus (ter), roughly opposite ori.

DNA is copied by an enzyme called DNA polymerase, which does not wait for the double helix to fully unwind: it starts synthesizing new DNA at the fork while the strands are still separating. DNA polymerase can only extend a strand in the 5'-to-3' direction. At each fork, one strand (the leading strand) is synthesized continuously in the direction the fork is moving. The other (the lagging strand) must be synthesized in short fragments called Okazaki fragments, going in the direction opposite to fork movement; these fragments are later stitched together by an enzyme called DNA ligase.

The critical asymmetry: the lagging strand template spends much more time as single-stranded DNA while waiting for the fork to advance. This matters because single-stranded DNA mutates in a specific, directional way. When DNA is single-stranded, cytosine (C) spontaneously decays into thymine (T) at a much higher rate (a reaction called deamination, roughly 100 times faster than in double-stranded DNA). So a strand that sits single-stranded for a long time slowly loses its C's.

This is the link between replication and counting letters. Note that no C ever becomes a G: the C's are simply destroyed (turned into T's), so a strand that sat single-stranded ends up with a deficit of C. Reading along that strand, G now outnumbers C, not because any G was created but because C's went missing. Each half of the genome was the long-single-stranded one for a different stretch, so G leads C in the direction the fork traveled (ori → ter) and the reverse holds on the other half. The G/C imbalance is a physical fingerprint of which way replication ran, so we can recover ori just by tracking it.

Wouldn't so many mutations damage the genome? They don't: each deamination is rare per replication, most are caught by DNA repair, and natural selection removes the harmful survivors. What builds up over evolutionary time is only a faint C-deficit at positions where mutations are tolerated, enough to bend the skew but far too little to harm the cell.

The Skew Diagram

Define the skew at position $i$ as the running value of $\#G - \#C$ up to position $i$: walk along the genome adding $+1$ for each G and $-1$ for each C. On the ori → ter half G dominates, so the running total rises; on the ter → ori half C dominates, so it falls. The curve therefore dips to a bottom and climbs back up: the minimum of the skew marks ori, and the maximum marks ter.

def skew(genome):
    skew_values, running_skew = [0], 0
    for base in genome:
        if base == 'G':
            running_skew += 1
        elif base == 'C':
            running_skew -= 1
        skew_values.append(running_skew)
    return skew_values

def min_skew(genome):
    skew_values = skew(genome)
    minimum = min(skew_values)
    return [
        position
        for position, value in enumerate(skew_values)
        if value == minimum
    ]

The skew approach is $O(n)$ and provides a fast, independent way to locate the ori region without any k-mer counting.

Approximate Pattern Matching

Real DnaA boxes are not identical copies of a single 9-mer. They have accumulated mutations over evolutionary time, so exact matching misses many occurrences. We need to find patterns that are "close" to a query string.

Hamming Distance

The Hamming distance $d_H(s_1, s_2)$ between two strings of equal length is the number of positions where they differ.

def hamming_distance(s1, s2):
    return sum(c1 != c2 for c1, c2 in zip(s1, s2))

Approximate Pattern Matching

Find all positions where pattern occurs in text with at most $d$ mismatches.

def approximate_pattern_match(text, pattern, d):
    k = len(pattern)
    return [
        i
        for i in range(len(text) - k + 1)
        if hamming_distance(text[i : i + k], pattern) <= d
    ]

Frequent Words with Mismatches

The final formulation: find k-mers that appear most frequently in the text, counting all approximate occurrences (up to $d$ mismatches). The idea is to generate the neighborhood of each k-mer in the text, i.e. all k-mers within Hamming distance $d$, and accumulate counts across all neighborhood members.

def neighbors(pattern, max_mismatches):
    neighborhood = {pattern}
    for _ in range(max_mismatches):
        for kmer in list(neighborhood):
            for position in range(len(kmer)):
                for base in "ACGT":
                    neighborhood.add(kmer[:position] + base + kmer[position + 1 :])
    return neighborhood

def frequent_words_with_mismatches(text, k, d):
    counts = {}
    for i in range(len(text) - k + 1):
        for neighbor in neighbors(text[i:i+k], d):
            counts[neighbor] = counts.get(neighbor, 0) + 1
    max_count = max(counts.values())
    return [kmer for kmer, c in counts.items() if c == max_count]

The number of neighbors of a k-mer within Hamming distance $d$ is $\sum_{j=0}^{d} \binom{k}{j} 3^j$. For $k = 9,\, d = 1$: each 9-mer has $1 + 9 \times 3 = 28$ neighbors. Total runtime: $O\!\left(n \cdot \sum_{j=0}^{d}\binom{k}{j} 3^j\right)$.

In practice, we apply this while also counting reverse complements, so that DnaA box signals from both strands are captured simultaneously.

This is a teaching implementation: enumerating every neighbor does not scale, and real tools replace it with indexed methods (suffix arrays, the FM-index) or probabilistic models.

Chapter 2: Which DNA Patterns Play the Role of Molecular Clocks?

Chapter 1 searched a single genome for a pattern that clusters in one place (the ori). This chapter addresses a different problem: finding a pattern that recurs (with variation) exactly once in each of many separate regions. The biological context is how cells keep time.

Biological Motivation: Circadian Clocks and Gene Regulation

Animals, plants, and bacteria have a circadian clock: an internal, roughly 24-hour cycle synchronized with day and night. The clock is implemented in DNA. A small set of circadian genes control it; in plants, three of them are called LHY, CCA1, and TOC1.

These genes encode transcription factors: master regulatory proteins that turn other genes on and off. A transcription factor binds to a short, specific stretch of DNA near the gene it controls. That stretch is called a regulatory motif (or transcription factor binding site). A daily cycle of gene activity is produced by transcription factors locating and binding their motifs.

The computational difficulty comes from one property: a regulatory motif is not an exact string. The same factor binds many slightly different sequences. For example, CCA1 binds both AAAAAATCT and AAGAACTCT, which differ at two positions. The target is therefore not a single fixed string but a family of similar strings. Identifying that family is the motif finding problem.

Motif Finding vs. Finding the Ori

Reusing frequent words with mismatches from Chapter 1 does not work well here, for two reasons: one concerns the structure of the signal, the other its cost.

Different signal structure. A DnaA box and a regulatory motif occur in opposite patterns:

A DnaA box clumps: it appears many times within one short interval (the ori region of a single genome).
A regulatory motif is scattered: it appears just once (perhaps mutated) in each of many different regions - for instance, in the upstream region of every gene that the same factor regulates. No single region contains a clump to detect.

Higher cost. Frequent-words-with-mismatches enumerates the entire neighborhood of every k-mer, and that neighborhood grows rapidly with both the motif length $k$ and the number of allowed mismatches $d$. It does not scale to the long, heavily mutated motifs in this setting. A different approach is required.

The Implanted Motif Problem

We restate the biology as a string problem. The input is a collection of DNA strings (one per region). We look for a k-mer that occurs (with up to $d$ mutations) once in every string.

In code this collection is the list dna. The quantities used throughout the chapter are:

$k$: the length of the motif we are searching for.
$t$: the number of strings in the collection.
$n$: the length of a single string.
$d$: the maximum number of mismatches allowed, where a fixed threshold is used.

A k-mer is a $(k, d)$-motif if it appears in every string of the collection with at most $d$ mismatches (Hamming distance $\le d$). The Implanted Motif Problem asks for all $(k, d)$-motifs in a collection.

Brute Force: Motif Enumeration

The brute-force solution is based on one observation: a $(k, d)$-motif must appear with $\le d$ mismatches in the first string too, so it is within Hamming distance $d$ of some k-mer of that string. Every $(k, d)$-motif therefore belongs to the combined $d$-neighborhood of the first string's k-mers. Generate that neighborhood as the candidate set, then keep only the candidates that also occur (with $\le d$ mismatches) in all of the strings.

def appears_within(text, pattern, d):
    """True if pattern occurs somewhere in text with at most d mismatches."""
    k = len(pattern)
    return any(
        hamming_distance(text[i:i + k], pattern) <= d
        for i in range(len(text) - k + 1)
    )

def motif_enumeration(dna, k, d):
    first = dna[0]
    motifs = set()
    for i in range(len(first) - k + 1):
        for candidate in neighbors(first[i:i + k], d):
            if all(appears_within(text, candidate, d) for text in dna):
                motifs.add(candidate)
    return motifs

This reuses hamming_distance and neighbors from Chapter 1. It is correct, but the neighborhood it builds grows rapidly with $k$ and $d$, the same scaling limitation seen in Chapter 1. The definition is also rigid: a $(k, d)$-motif must stay within $d$ mismatches in every string, so one region with a single extra mutation removes an otherwise valid motif from the results.

The Subtle Motif Problem

A standard benchmark illustrates this limitation. Implant one copy of a 15-mer into each of 10 randomly generated 600-nucleotide strings, and mutate 4 random positions in every copy. This is the Subtle Motif Problem.

The implanted 15-mer is "subtle" because four mutations out of fifteen is a high mutation rate: two implanted copies may differ at up to 8 positions, the motif is usually not the most frequent k-mer in the data, and motif_enumeration with $k = 15$, $d = 4$ generates a neighborhood far too large to enumerate. A usable method must handle this level of noise without relying on a fixed mismatch threshold.

Scoring a Set of Motifs

The key idea is to change the formulation. Instead of searching for a pattern, we choose one k-mer from each string and measure how similar the chosen k-mers are to one another. A good choice produces columns dominated by a single nucleotide; a poor choice produces columns close to uniform. Expressing this similarity as a single number turns motif finding into an optimization problem.

Stack the $t$ chosen k-mers as the rows of a $t \times k$ motif matrix, then read it column by column. Take this example with $t = 4$ and $k = 5$:

    A C G T A
    A C G C A
    A T G T A
    A C T T A

The count matrix has 4 rows (one per nucleotide) and $k$ columns (one per position); entry $(i, j)$ is how many times nucleotide $i$ appears in column $j$:

       1   2   3   4   5
   A   4   0   0   0   4
   C   0   3   0   1   0
   G   0   0   3   0   0
   T   0   1   1   3   0

The profile matrix divides each count by $t$, giving the frequency $P_{ij}$ of nucleotide $i$ in column $j$ (every column sums to 1):

        1      2      3      4      5
   A  1.00   0.00   0.00   0.00   1.00
   C  0.00   0.75   0.00   0.25   0.00
   G  0.00   0.00   0.75   0.00   0.00
   T  0.00   0.25   0.25   0.75   0.00

The consensus string takes the most frequent nucleotide in each column (here, ACGTA). It is the candidate regulatory motif for these regions.

Finally, the score measures disagreement: in each column, count the nucleotides that are not the most frequent one (the "unpopular" letters) and sum across columns. Columns 1 and 5 are uniform; columns 2, 3, and 4 each contain one differing letter, so the score is $0 + 1 + 1 + 1 + 0 = 3$. A fully conserved motif scores 0, and the more the chosen k-mers differ, the higher the score. The goal is to minimize it.

The four quantities translate directly into code:

NUCLEOTIDES = "ACGT"

def count_matrix(motifs):
    k = len(motifs[0])
    counts = {base: [0] * k for base in NUCLEOTIDES}
    for motif in motifs:
        for position, base in enumerate(motif):
            counts[base][position] += 1
    return counts

def profile_matrix(motifs):
    t = len(motifs)
    counts = count_matrix(motifs)
    return {base: [value / t for value in row] for base, row in counts.items()}

def consensus(motifs):
    counts = count_matrix(motifs)
    return ''.join(
        max(NUCLEOTIDES, key=lambda base: counts[base][position])
        for position in range(len(motifs[0]))
    )

def score(motifs):
    counts = count_matrix(motifs)
    t = len(motifs)
    return sum(
        t - max(counts[base][position] for base in NUCLEOTIDES)
        for position in range(len(motifs[0]))
    )

The Motif Finding Problem

Scoring lets us restate the goal without ever mentioning $d$:

Motif Finding Problem. Given a collection of DNA strings, choose one k-mer from each string so that the resulting motif matrix has the minimum possible score.

This formulation has no mismatch threshold $d$; it specifies a single quantity to minimize. Solving it by brute force is still expensive. Suppose each of the $t$ strings has length $n$. Each string offers $n - k + 1$ possible k-mers, and we pick one per string independently, giving $(n - k + 1)^t$ candidate motif matrices. Scoring one matrix costs about $k \cdot t$ steps, and since $k \ll n$ the total cost is $O(k \cdot t \cdot n^t)$. The $n^t$ term makes the cost exponential in the number of strings $t$, which is impractical.

from itertools import product

def brute_force_motif_search(dna, k):
    kmers_per_string = [
        [text[i:i + k] for i in range(len(text) - k + 1)]
        for text in dna
    ]
    best_motifs, best_score = None, float('inf')
    for motifs in product(*kmers_per_string):   # (n - k + 1) ** t combinations
        current_score = score(list(motifs))
        if current_score < best_score:
            best_motifs, best_score = list(motifs), current_score
    return best_motifs

We now have a well-defined objective but no efficient algorithm for it. Practical algorithms (greedy and randomized motif search) give up guaranteed optimality in exchange for efficiency on real data.

Scoring a k-mer with a Profile: the Most Probable k-mer

Both algorithms that follow rely on one new tool. So far a profile has been a way to summarize a chosen set of motifs. We now read it in the other direction, as a predictive model: given a profile, how well does an arbitrary k-mer fit it?

Treat each column of the profile as the probability of seeing each nucleotide at that position, and assume the positions are independent. The probability of a k-mer given a profile is then the product of the column frequencies of its letters. For a k-mer with letters $a_1 a_2 \cdots a_k$:

$\Pr(\text{k-mer} \mid \text{Profile}) = \prod_{j=1}^{k} \text{Profile}(a_j,\ j)$

Using the profile from the previous section, the consensus ACGTA scores $1.00 \times 0.75 \times 0.75 \times 0.75 \times 1.00 = 0.42$, the highest of any 5-mer. Any k-mer that disagrees with the consensus scores lower, because it must pick up a smaller frequency in at least one column.

The profile-most-probable k-mer in a string is the k-mer that maximizes this probability: the string's single best candidate for the motif the profile describes.

def probability(kmer, profile):
    prob = 1.0
    for position, base in enumerate(kmer):
        prob *= profile[base][position]
    return prob

def profile_most_probable_kmer(text, k, profile):
    best_kmer, best_prob = text[:k], -1.0
    for i in range(len(text) - k + 1):
        kmer = text[i:i + k]
        prob = probability(kmer, profile)
        if prob > best_prob:
            best_kmer, best_prob = kmer, prob
    return best_kmer

Ties are broken in favor of the first occurrence.

Greedy Motif Search

A greedy algorithm builds a solution one piece at a time, and at every step it commits to the choice that looks best at that moment, never going back to revise it. It is the paradigm behind algorithms such as Huffman coding and Dijkstra's shortest paths. It gives up the guarantee of reaching the global optimum in exchange for speed.

Greedy motif search builds the motif set one string at a time:

Pick a k-mer from the first string as a seed. It is the only row of a one-row motif matrix, and its profile is the first (overconfident) guess of what the motif looks like.
Move to the second string and select its profile-most-probable k-mer: the k-mer that best fits the current guess. Add it. The profile, now built from two k-mers, is a slightly better guess.
Continue down the list of strings. Each new string contributes the k-mer that best fits the profile built from all of the previously chosen k-mers.

After the last string we have one complete set of $t$ motifs. Its quality depends entirely on the seed, so we run the whole process once for every k-mer of the first string and keep the lowest-scoring result.

def greedy_motif_search(dna, k):
    best_motifs = [text[:k] for text in dna]
    first = dna[0]
    for i in range(len(first) - k + 1):
        motifs = [first[i:i + k]]
        for text in dna[1:]:
            profile = profile_matrix(motifs)
            motifs.append(profile_most_probable_kmer(text, k, profile))
        if score(motifs) < score(best_motifs):
            best_motifs = motifs
    return best_motifs

This is polynomial, roughly $O(n^2 \cdot k \cdot t)$ for $t$ strings of length $n$: a decisive improvement over the exponential brute force. The weakness is inherent to greed: an early k-mer chosen from a profile based on very little evidence can steer every later choice the wrong way, with no mechanism to recover.

The Zero Problem and Pseudocounts

There is a flaw in scoring k-mers with a raw profile. When only a few motifs have been chosen, many profile entries are exactly $0$, for nucleotides that have not yet appeared in a column. Because the probability of a k-mer is a product, a single $0$ makes the entire value $0$.

Take the profile from before, whose consensus is ACGTA. The k-mer GCGTA differs from the consensus at only the first position, yet column 1 has frequency $0$ for G, so $\Pr(\texttt{GCGTA} \mid \text{Profile}) = 0$. A near-perfect match is declared impossible, and all the information from its four matching columns is discarded. This is too strict, and it happens precisely when the profile is built from few motifs, as in both algorithms here.

The fix is Laplace's rule of succession: add a pseudocount of $1$ to every entry of the count matrix before dividing. Each column gains four extra observations, so the frequencies use a denominator of $t + 4$. No entry is ever $0$, so a single unpopular letter lowers a k-mer's probability instead of forcing it to zero.

def profile_with_pseudocounts(motifs):
    t = len(motifs)
    counts = count_matrix(motifs)
    return {
        base: [(value + 1) / (t + 4) for value in row]
        for base, row in counts.items()
    }

Both greedy and randomized motif search use this pseudocounted profile in place of the raw one. With it, GCGTA above receives column-1 frequency $(0 + 1) / (4 + 4) = 0.125$ and a small but nonzero probability.

Randomized Motif Search

Randomized motif search rests on a circular relationship:

given the motifs, computing the profile is easy;
given the profile, finding the motifs is easy (take the profile-most-probable k-mer in each string).

The trouble is that we start with neither. The algorithm breaks the circle by guessing: choose one k-mer at random from each string, then alternate the two easy steps. Build a profile from the current motifs, use it to re-select the motifs, and repeat. Each pass replaces the motif set with one that fits its own profile better, so the score never rises; when a pass fails to improve it, the search has settled and stops.

import random

def motifs_from_profile(profile, dna, k):
    return [profile_most_probable_kmer(text, k, profile) for text in dna]

def random_motifs(dna, k):
    motifs = []
    for text in dna:
        start = random.randint(0, len(text) - k)
        motifs.append(text[start:start + k])
    return motifs

def randomized_motif_search(dna, k):
    motifs = random_motifs(dna, k)
    best_motifs = motifs
    while True:
        profile = profile_with_pseudocounts(motifs)
        motifs = motifs_from_profile(profile, dna, k)
        if score(motifs) < score(best_motifs):
            best_motifs = motifs
        else:
            return best_motifs

Why would starting from random k-mers ever find a real motif? Random k-mers produce a profile that is nearly, but not perfectly, uniform. If even a couple of them happen to overlap the true motif, the profile leans slightly toward it, and that lean is enough for the re-selection step to pull in better-matching k-mers from the other strings, which makes the profile lean further. It is a positive feedback loop toward whatever signal the starting set happened to contain.

The catch is that a single run almost always locks onto noise and converges to a poor local optimum, because a random start usually contains no real signal. The remedy follows from the cost: one run is cheap, so run it thousands of times from independent random starts and keep the best result overall. With enough starts, at least one begins close enough to the true motif to converge to it. This shape, alternating two cheap steps from many random starts, is the same one behind clustering methods such as k-means.

def best_over_many_runs(dna, k, runs=1000):
    best_motifs = randomized_motif_search(dna, k)
    for _ in range(runs - 1):
        candidate = randomized_motif_search(dna, k)
        if score(candidate) < score(best_motifs):
            best_motifs = candidate
    return best_motifs