TextRank is an algorithm based upon PageRank for text summarization. In TextRank, the vertices of the graph are sentences, and the edge weights between sentences denotes the similarity between sentences.
Use the following steps, we can extracte important sentences from a set of documents.
- Sentence identification: transfer the documents into sentences
- Tokenization: Split each sentence into a set of words
- Similarity calculation: Calculate the similarity between sentences
- Build sentence graph: build a graph of the sentences
- TextRank: score the sentences via pagerank
We can use nltk’s included Punkt module to get sentences from a document.[Read More...]