Longest Common Subsequence

Tags: ,

A subsequence is a sequence that can be derived from another sequence by deleting some elements without changing the order of the remaining elements. It is a set of characters that appear in left-to-right order, but not necessarily consecutively.

Longest common subsequence (LCS) of 2 sequences is a subsequence, with maximal length, which is common to both the sequences.

Problem definition of Longest Common Subsequence

  1. Given the first sequence which contains (m) symbols X = (x1, x2, x3, …, xm)
  2. Given the second sequence which contains (n) symbols Y = (y1, y2, y3, …, yn)
  3. Find the longest common sequence (Z) between (X) and (Y) call it Z = (z1, z2, z3, …, zk)

See the following example:

The longest common subsequence of the two strings are AAAATTCA

Analysis

Let the input sequences be X[0..m-1] and Y[0..n-1] of lengths m and n respectively. And let L(X[0..m-1], Y[0..n-1]) be the length of LCS of the two sequences X and Y. We can recursive define L(X[0..m-1], Y[0..n-1]) as following:

  1. If the last characters of both sequences match (e.g. X[m-1] == Y[n-1]) then

    L(X[0..m-1], Y[0..n-1]) = 1 + L(X[0..m-2], Y[0..n-2])

  2. If the last characters of both sequences do not match (or X[m-1] != Y[n-1]) then

L(X[0..m-1], Y[0..n-1]) = MAX ( L(X[0..m-2], Y[0..n-1]), L(X[0..m-1], Y[0..n-2])

We can easily implement the recursive algorithm based on the above idea:

The above naive recursive algorithm has time complexity of O(2^n) in worst case when all characters of X and Y do not match. The reason for this high complexity is because there are many subproblems solved again and again. One way to improve is to use Memorization, which records the answers of the subproblems for reuse in the following calls thus avoiding repeated calculation. 

Another method is to use dynamic programming, which uses a table to solve the problem in a bottom up manner. 

We define a table C, where C[i, j] denotes the LCS (Longest Common Subsequence) of X[0 .. i – 1] and Y[0..j – 1]. 

Then we have the following rule to update the table for i from 1 to m and j from 1 to n given the input X[0 .. m-1] and Y[0 .. n – 1]. 

C[i, j] = 0 if i = 0 or j = 0  // this means X[0..i – 1]] is empty or Y[0..j – 1] is empty, C[i, j] = 0
C[i, j] = 1 + C[i-1, j-1] if i > 0 and j > 0 and X[i – 1] == Y[j – 1] 
C[i, j] = max (C[i, j-1], C[i-1, j]) if i > 0 and j > 0 and  X[i – 1]  != Y[j – 1] 

See the following code that is accepted by hackerrank:

You can submit and test your code at hackerrank using the following link:

https://www.hackerrank.com/challenges/dynamic-programming-classics-the-longest-common-subsequence

Here are some good resources to learn longest common subsequence problem:

http://www.8bitavenue.com/2011/11/dynamic-programming-longest-common-subsequence/

http://www.columbia.edu/~cs2035/courses/csor4231.F11/lcs.pdf

http://www.geeksforgeeks.org/dynamic-programming-set-4-longest-common-subsequence/

https://en.wikipedia.org/wiki/Longest_common_subsequence_problem