  • Document Summarization using TextRank Example

    TextRank is an algorithm based upon PageRank for text summarization. In TextRank, the vertices of the graph are sentences, and the edge weights between sentences denotes the similarity between sentences.

    Use the following  steps, we can extracte important sentences from a set of documents.

    1. Sentence identification: transfer the documents into sentences
    2. Tokenization: Split each sentence into a set of words
    3. Similarity calculation: Calculate the similarity between sentences
    4. Build sentence graph: build a graph of  the sentences
    5. TextRank: score the sentences via pagerank

    Sentence identification

    We can use nltk’s included Punkt module to get sentences from a document.

    