20 Longest Common Subsequence
This section considers another DP example, longest common subsequence (LCS).
Objectives. After learning this material, you should be able to:
- Execute the LCS algorithm on example inputs.
- Identify the DP components of the LCS algorithm.
- Solve new DP problems similar to LCS.
20.1 The problem and algorithm
In the longest common subsequence problem, our goal is to compare two sequences to find the longest subsequence that they have in common. Recall that a subsequence does not have to be consecutive. This problem is a part of, for example, version control software like git that needs to track changes to a document.
- Input: two sequences A and B (we will suppose the elements are characters or inetegers).
- Output: the length of the longest sequence that is a subsequence of both A and B.
Exercise 20.1 Let A = “ALGORITHM” and B = “ANARCHISM”. What is their longest common subsequence?
Solution.
One answer is ARIM, for a length of 4. Another is ARHM, also with length 4.
As always, the first question is the subproblem definition. A natural first try is a smaller version of the original problem. In that case, let C[i,j] = the length of the longest common subsequence of A[1:i] and B[1:j], where A[1:i] denotes the prefix of A from characters 1 to i and similarly for B[1:j].
In this case, to comptue the final answer, we just return C[n,m], where n = len(A) and m = len(B).
For the recurrence base case, if either input has zero characters, then the answer is zero, so C[0,j] = 0 for all j and C[i,0] = 0 for all i.
For the recurrence inductive case, suppose \(i,j \geq 1\). We need to consider cases. If A[i] == B[j], then one possibility for C[i,j] is a subsequence that ends with this character. The optimal length would be the length of a subsequence of A[1:i-1] and B[1:j-1], plus one more for this final character. This gives option a := 1 + C[i-1,j-1]. If A[i] != B[j], then we can set option a := 0.
Then, regardless of whether A[i] and B[j] are equal, we have two more options. If we do not include the last character of A in our subsequence, then our solution is the same as on A[1:i-1], which has value b := C[i-1,j]. And if we do not include the last character of B, then similarly the value is c := C[i,j-1]. (Note that if we do not include both, this will be covered by b and c.)
Putting these together, we can choose the best of these choices, C[i,j] = max{a, b, c}.
Combining all of the elements gives us this DP algorithm. Notice that we need to decide how to iterate through the subproblems. It’s important to make sure that when we’re solving subproblem (i,j), we’ve already solved all the subproblems it depends on: in this case, (i-1,j-1), (i-1,j), and (i,j-1).
Algorithm 20.1
Exercise 20.2 Execute the algorithm on input A = ALGORITHM, B = ANARCHISM. Fill in the two-dimensional table and give the final answer. Briefly explain how you filled in the squares where \(i \leq 2\) and \(j \leq 2\).
Solution.
| ’’ | A | L | G | O | R | I | T | H | M | |
| ’’ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| N | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| A | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| R | 0 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 2 |
| C | 0 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 2 |
| H | 0 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 3 | 3 |
| I | 0 | 1 | 1 | 1 | 1 | 2 | 3 | 3 | 3 | 3 |
| S | 0 | 1 | 1 | 1 | 1 | 2 | 3 | 3 | 3 | 3 |
| M | 0 | 1 | 1 | 1 | 1 | 2 | 3 | 3 | 3 | 4 |
The final answer is C[n,m] = 4.
To fill in those squares, first we used the base case to set C[i,j] = 0 if i == 0 or j == 0. Then for C[1,1], because the first characters are both A, they match and we use C[1,1] = 1 + C[0,0] = 1. For C[1,2], they don’t match, so we use the max of C[1,1] and C[0,2], which is 1. For C[2,1], they don’t match, so we use the max of C[1,1] and C[2,0], which is 1. For C[2,2], they don’t match, so we use the max of C[1,2] and C[2,1], which is 1.
Exercise 20.3 Explain how to modify the algorithm to reconstruct the longest subsequence itself. You do not need to write the entire code of a new algorithm, just describe how to do it. Briefly justify correctness.
Solution.
We make a second array D where D[i,j] tells us which choice we made when filling in C[i,j]. We can let D[i,j] = “a”, “b”, or “c” depending on which one achieved the max.
To reconstruct the subsequence afterward:
- Start at i = n, j = m. Let S be an empty sequence.
- If D[i,j] == “a”: If A[i] == B[j], then we add this character to the beginning of S. Regardless, we then let i -= 1, j -= 1.
- If D[i,j] == “b”, then we let i -= 1.
- If D[i,j] == “c”, then we let j -= 1.
- Repeat until i or j is zero, then stop and return S.
This is correct because: if D[i,j] == “a”, then the optimal longest common subsequence of A[1:i] and B[1:j] includes the current character A[i] == B[j], preceded by the optimal subsequence of A[1:i-1] and B[1:j-1]. So we add this character to S, and then recurse to the case (i-1,j-1). Similarly for the other options.