Good resources for html extraction using python

Here are some good python tools to extract texts from html pages:

  1. http://www.clips.ua.ac.be/pages/patter

Pattern

Pattern is a web mining module for the Python programming language.

It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and <canvas> visualization.

pattern.web

The pattern.web module has tools for online data mining: asynchronous requests, a uniform API for web services (Google, Bing, Twitter, Facebook, Wikipedia, Wiktionary, Flickr, RSS), a HTML DOM parser, HTML tag stripping functions, a web crawler, webmail, caching, Unicode support.

HTML to plaintext

The HTML source code of a web page can be retrieved with URL.download(). HTML is a markup language that uses tags to define text formatting. For example, <b>hello</b> displays hello in bold. For many tasks we may want to strip the formatting so we can analyze (e.g., parse or count) the plain text.

The plaintext() function removes HTML formatting from a string.

plaintext(html, keep=[], replace=blocks, linebreaks=2, indentation=False)

It performs the following steps to clean up the given string:

  • Strip javascript: remove all <script> elements.
  • Strip CSS: remove all <style> elements.
  • Strip comments: remove all <!– –> elements.
  • Strip forms: remove all <form> elements.
  • Strip tags: remove all HTML tags.
  • Decode entities: replace &lt; with < (for example).
  • Collapse spaces: replace consecutive spaces with a single space.
  • Collapse linebreaks: replace consecutive linebreaks with a single linebreak.
  • Collapse tabs: replace consecutive tabs with a single space, optionally indentation (i.e., tabs at the start of a line) can be preserved.

See the following examples: