Aho-Corasick Algorithm

Tags: , , , ,

Problem:

Given an input text and an a dictionary, find all occurrences of all words in the input text.
Example:

We can solve this problem using KMP algorithm, but it is not efficient.

The solution is Aho-Corasick automation (AC automaton), which is one of the most famous string pattern match algorithms. See https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm for an introduction.

By using AC automaton, we can solve the problem in O(n + m +  z), where n is the total length of the words in the dictionary, m is the length of the string T, z is the occurrences of the valid words in T.

To use AC automaton, there are two steps:

1. Build AC automaton
2. Traverse the string T to get all the valid words using the built AC automaton

1.  Build AC automaton
     1.1 build a Trie based on the dictionary
            Trie is a powerful data structure for string match. Let’s see an example:

Given a dictionary, test whether a string is a valid prefix of a word in a dictionary?
The answer is Trie
            
             The following figure shows what a Trie looks like based on a simple dictionary.

aho-corasick

    
 1.2. build fail pointer for the state in the Trie
 
    The dashed arrows denote the failed pointers
   
    How to understand this figure: Let’s look at state 5. Why does it point to 2?

   
aho-corasick2

    Let’s say we have a string “shers”, we start from state 0, then following char s, we go to 3,then following char h, we go to state 4, then following char e, we go to 5, but following r, we can not go anywhere from state 5.

Considering that the longest postfix of she if he, if he is also a prefix of another word in the dictionary, we can continue the search without loosing any possibility of matching words start with he, and we don’t need to start from h again in the string shers.

That’s why the state 5 points to state 2, as the string she can match any words start with he.  We call the pointer from state 5 to 2 as failed pointer, as state 5 has no match following the char r in our example. 

   Once we find that: from state 5, following r will result in nothing valid state (null), we will follow the fail point of state 5, which is state 2. Now we are standing on state 2, then following r, we go to state 8, then following s, we go to state 9, now we find hers is a valid word in the string shers.

 how to find all the failing pointers? The answer is BFS (Bread First Search)

1.  The fail pointer of root is always root itself.

2. The fail pointer of the direct children of root is also the root

3. We do BFS starting from each of the direct children of the root.

for current state cur, we do following operation:

 
 2. Traverse target string T using the automaton
 

I implemented the algorithm using python. Here is the code.

Reference:
http://www.cs.uku.fi/~kilpelai/BSA05/lectures/slides04.pdf