Interview Questions: Edit distance

Tags: , , , ,

Edit distance related questions are frequently asked by many large companies such as Google, Facebook, Apple, Microsoft, Uber, Airbnb, Amazon, et al. In this post, we describe three kinds of problems related to edit distance.

What is edit distance?

Edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other. Edit distances can be applied in natural language processing for automatic spell correction by selecting words from a dictionary that have a low distance to the word in question. 

Problem 1

The most basic problem related to edit distance is to implement the edit distance between two words. Here is the problem description from LeetCode:

Given two words word1 and word2, find the minimum number of steps required to convert word1 to word2. (each operation is counted as 1 step.)

You have the following 3 operations permitted on a word:

a) Insert a character
b) Delete a character
c) Replace a character

A typical solution to this problem is using Dynamic Programming. We use a table dis[][] to record the edit distance between the prefixes of the two words. 

e.g, dis[i][j] denotes the edit distance between word1[0:i] and word2[0:j], so the final answer is dis[word1.length][word2.length].

Initially,

dis[0][j] = j, this means the second word will remove j letters to be the same as the first one (e.g, empty)

dis[i][0] = i, this means the second word will insert i letters to be the same as the first one. 

Then we update the table using the following rules:

dist[i][j] = min ( dis[i-1][j] +1, dis[i][j-1] + 1, dist[i-1][j-1] + 1) if word1[i] != word2[j]

dis[i][j] = dis[i-1][j-1] if word1[i] == word2[j]

Please refer to this post for the implementation of edit distance in Java.

Problem 2

Given a dictionary and user’s input string w, find a word from the dictionary such the its edit distance with w is 1. 

Analysis

A native solution is to use the edit distance algorithm to calculate the edit distance between w and each of the words in the dictionary. We return the word if its edit distance to w is 1. The time complexity for an efficient edit distance algorithm is O(n^2).  Suppose the word w has length n, and there are m words in the dictionary, the time complexity will be O(m * n^2 ).  This is too slow. 

To deal with string pattern matching problems with a dictionary, a usual method is to build a Trie using the dictionary. However, we can not use Trie directly here, as the Trie is normally used to check whether the prefix of a word is in the dictionary. 

As edit distance equals 1 means there can be one mismatch when we search the word w in the Trie, we can still use Trie to solve this problem. The basic idea is to use a variable to record whether a mismatch occurs. If the current letter L does not match with the current Node n in the Trie, and a mismatch has not occurred, we can 1) remove the current letter L from word w, and set mismatched as true, then continue the search from the letter after L, 2) set mismatch as true, and from each children nc of n,   insert it before L, then continue the search from nc

when we search the word w in the Trie, we allow is 1 mismatch. 

However, this method can not be work when the edit distance is large. See Problem 3.

Problem 3

Given a dictionary and a word w, find all the words from the dictionary such that its edit distance to w is less than k

This problem is very important, as it is the fundamental to the automatic spell checking algorithm. 

We can not use the Trie based method now, as if we allow more than 1 mismatch, there can be the case that we add a letter and delete again. 

There are three methods to search a dictionary for the words with edit distance less than k. 

  1. The Naive approach

    The obvious way is to compute the edit distance from the query term to each words in a dictionary, then get the qualified terms for words suggestion.  We can improve this method by terminating the edit distance calculation when a threshold 2 or 3 has been reached.

  2. Peter Norvig Algorithm

    The basic idea of this algorithm is to generate all possible terms with an edit distance less than k to form a candidate set.  Then we can search those generated candidates from the dictionary.  The operations to generate all the terms with edit distance less than 2 include: delete x <= k letters, transposes  x<= k times, replaces x <= k  times, and insert x <= k times. 

    For a word length of length n, an letter size a, and an edit distance d = 1, there will be n deletions, n – 1 transpositions, a * n replacements, and a * (n + 1) insertions, For a total of 2n + 2an + a - 1 terms for search from the dictionary. 

    This is much better than method 1, but it still inefficient. For example, for n = 9, a = 36, d = 2, there will be 114, 324 terms. It is also language dependent as the letters set is language dependent. 

    Reference:http://norvig.com/spell-correct.html

  3. Symmetric Delete Spelling Correction (FAROO)

    The best algorithm so far is the FAROO algorithm.  The basic idea of this algorithm is as follows:

    1. Use delete only methods to generate all the terms with edit distance <= 2 from each term in the dictionary, and append them the dictionary. This is done only once as a pre-calculation step. 

    2. Use delete only methods to generate all the terms with edit distance <= 2 from the input term, then search them in the dictionary. 

    For a word with length n, letter size with a, and edit distance with 1, there will be just n deletions, for a total of n terms at search time. 

    Not only does this method faster, but also it is language independent, as no letters is used to do replacement. The only cost is the pre-calculation of the dictionary and the larger storage space because the number of deletes for each terms in the dictionary.  The number of deletes for a dictionary also depends on the maximums edit distance. 

    Some Remarks:

    1.  Different words in the dictionary might lead to same term after deletion. For example,  delete(sun,1) == delete(sin,1) == sn. While we generate only one new dictionary entry (sn),  both original terms need to be stored for spelling correction suggestion (sun,sin).

     
    2. There are four different comparison pair types:
    • dictionary term == input term,
    • delete(dictionary term, p1) == input term
    • dictionary term == delete(input term,p2)
    • delete(dictionary term,p1) == delete(input term,p2)
    The fouth comparison type is required for replaces and transposes only. But there can be false positives of higher edit distance. See: delete(bank, 2) == an

    delete(kanb, 2) == an, but bank != kanb. bank != xban, and bank != baxn. So it is necessary to check whether the suggest dictionary term is really a replace or an adjancent transpose of the input term.

    3. Instead of a static dictionary, the search engine index can be used as it is dynamic updated. The frequency of the search term can be used to filter out newly indexed terms. 

     
  4. Problem 4

    Given a dictionary, find all the paris of terms with edit distance == k. 
    We can use the FAROO algorithm solve this problem. For each word in the dictionary, we delete k/2 letters, then we check whether a pair of two words have the same variants. See delete(bank, 2) == an, delete(xank, 2) == an,  so we have dist(bank, xank) <= 4. In this way, we can reduce the search space dramatically. 

Reference:
http://baozitraining.org/blog/string-edit-distance/
http://blog.faroo.com/2012/06/07/ improved-edit-distance-based-spelling-correction/