Resources for article extraction from HTML pages
Here are some good resources to learn how to extract articles from html pages.
Research papers and Articles for article extraction from HTML pages
- Boilerplate Detection using Shallow Text Features
- Extracting Article Text from the Web with Maximum Subsequence Segmentation
- Text Extraction from the Web via Text-to-Tag Ratio
- Web Content Extraction Through Histogram Clustering (another version)
- VIPS: a Vision-based Page Segmentation Algorithm
- Automatic Web News Extraction Using Tree Edit Distance: This algorithm uses a tree comparison metric analogous to Levenshtein distance to detect relevant content in a set of HTML documents.
- Discovering Informative Content Blocks from WebDocuments: employs entropy as a threshold metric to predict informative blocks of content.
- Web Page Cleaning with Conditional Random Fields: This paper presents the best performing algorithm which makes use of CRF to label blocks of content as text or noise based on block level features.
- Hierarchical wrapper induction for semistructured information sources
- Template detection for large scale search engines
- Web Page Cleaning for Web Mining through Feature Weighting
- Eliminating noisy information in Web pages for data mining
Some good blog articles:
The Easy Way to Extract Useful Text from Arbitrary HTML. The author is using examples written in python to employ a fairly similar technique described in the text-to-tag ratio paper listed above. The original link is dead, here is a copy: http://www.cnblogs.com/loveyakamoz/archive/2011/08/18/2143965.html
Software for article extraction from HTML pages
- Boilerpipe library: an open source Java library. The library itself is the official implementation of the overall algorithm presented in the previously mentioned paper by Kohlschütter et al.
- Project Goose by Gravity labs
- Perl module HTML::Feature
- Webstemmer is a web crawler and page layout analyzer with a text extraction utility
- Demo of VIPS packaged in a .dll (it’s use is limited to research purposes only)
code is here:
It has been integrated into Apache Tika as well
Demo Web Service:
Research presentation (WSDM 2010):