Resources for article extraction from HTML pages

Here are some good resources to learn how to extract articles from html pages.

Research papers and Articles for article extraction from HTML pages

Some good blog articles:

 The Easy Way to Extract Useful Text from Arbitrary HTML. The author is using examples written in python to employ a fairly similar technique described in the text-to-tag ratio paper listed above. The original link is dead, here is a copy: http://www.cnblogs.com/loveyakamoz/archive/2011/08/18/2143965.html

Software for article extraction from HTML pages

code is here:

http://code.google.com/p/boilerp…
It has been integrated into Apache Tika as well

Demo Web Service: http://boilerpipe-web.appspot.com/
Java library: http://code.google.com/p/boilerp…
Research presentation (WSDM 2010): http://videolectures.net/wsdm201…