Python-based html5lib in Firefox's new HTML5 Parser

  • submit to reddit

Mitchell is a DZone employee and has posted 1652 posts at DZone. View Full User Profile



This content is part of the Python Zone, which is presented to you by DZone and New Relic. Visit the Python Zone for news, tips, and tutorials on the Python programming language.  provides the resources and best practices to help you monitor these applications.
Last month, Firefox started shipping with a new HTML5 parser by default.  Although  it might not seem like much, it's another milestone in the journey of HTML5 since it replaces Gecko's old HTML parser.  The transition was seamless and allows detailed defining of HTML5 documents.  It started with html5lib, a Python implementation of the WHATWG HTML5 spec (essentially a tokenizer, a parser, and a serializer), which was developed by James Graham and Anne van Kesteren.

html5lib, a python library for parsing HTML, started in 2006 and has gone from version 0.1 to 0.9.  It focuses less on performance, like some C libraries, and instead does a much better job of recognizing the wide variety of HTML on the web.    There is also a PHP implementation and a Ruby port that hasn't been maintained for awhile.  The user documentation elaborates

In version 0.9 html5lib gained the following features:

  • Parses valid and invalid HTML documents to a tree
  • Support for minidom, ElementTree (including cElementTree and lxml.etree), BeautifulSoup (deprecated) and custom simpletree output formats
  • DOM to SAX converter
  • Reports parse errors
  • Character encoding detection
  • Filtering and serializing of trees
  • HTML+CSS sanitizer
  • Many unit tests

(Users of the sanitizer must ensure that they serialize with quoted attribute values to avoid some known script injection holes in older browsers including IE < 8)

The best part of the library is the test suite for parsing HTML according to the HTML5 spec.  Using this test suite, Mozilla developed the Validator.nu code to drive the new HTML parser in the Gecko rendering engine. 

Maybe you'll find an application or build one that uses the html5lib library.  The project is open to new contributors and the code is available under the MIT license.

Python is a fast, powerful, dynamic, and versatile programming language that is being used in a variety of application domains. It has flourished as a beginner-friendly language that is penetrating more and more industries. The Python Zone is a community that features a diverse collection of news, tutorials, advice, and opinions about Python and Django. The Python Zone is sponsored by New Relic, the all-in-one web application performance tool that lets you see performance from the end user experience, through servers, and down to the line of application code.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)