Document Summarization using TextRank Example

Tags: ,

TextRank is an algorithm based upon PageRank for text summarization. In TextRank, the vertices of the graph are sentences, and the edge weights between sentences denotes the similarity between sentences.

Use the following  steps, we can extracte important sentences from a set of documents.

  1. Sentence identification: transfer the documents into sentences
  2. Tokenization: Split each sentence into a set of words
  3. Similarity calculation: Calculate the similarity between sentences
  4. Build sentence graph: build a graph of  the sentences
  5. TextRank: score the sentences via pagerank

Sentence identification

We can use nltk’s included Punkt module to get sentences from a document.

Words tokenization

TextRank is based on the bag of word model. We can use space and other marks to split a sentence into a list of words:

We can also use scikit-learn to do text feature extraction. 

 

 

By applying this to the entire documents,we can generate a  matrix.

Build a sentence Graph

In the matrix, the rows are sentences and the columns are words. To transform the matrix into a graph, we can use the Scikit-learn’sTfidfTransformer.

This will re-weight each word based upon its tf-idf.

In this matrix, both the rows and columns are sentences. The element in the matrix is the similarity between a pair of sentences.  Score 1 means two sentences are identical, and score 0 means there is no overlap between two sentence.

Pagerank

After build a graph, we can use pagerank algorithm to calculate the importance of the node in the graph.  We can use the pagerank method from NetworkX.

The output is a map from indices to scores. Using the index, we can get the original text. 

Now sort them based on the pagerank score. 

The whole implementation is as followes:

Use TextRank, we can choose relevant sentences from a text. It can also be adapted to identify keywords that are important to a text. 
 
Reference:http://joshbohde.com/blog/document-summarization