HTML5 Zone is brought to you in partnership with:

Dr. Axel Rauschmayer is a freelance software engineer, blogger and educator, located in Munich, Germany. Axel is a DZone MVB and is not an employee of DZone and has posted 246 posts at DZone. You can read more from them at their website. View Full User Profile

Transforming HTML with Node.js and jQuery

02.18.2012
| 15313 views |
  • submit to reddit
The npm module jsdom enables you to use jQuery to examine and transform HTML on Node.js. This post explains how.

The Basics

As a tool for processing HTML, Node.js offers an important foundation: It can download or upload data and it can read or write to disks [1]. What it lacks is the ability to parse and transform HTML. Luckily, the jQuery framework is ideally suited for this task. The jsdom module implements the HTML DOM on top of Node.js, which is everything that jQuery needs to run on that platform. To install it, use the node package manager:

npm install jsdom

jsdom is very easy to use:

var htmlSource = fs.readFileSync("dummy.html", "utf8");
    call_jsdom(htmlSource, function (window) {
    var $ = window.$;

    var title = $("title").text();
    $("h1").text(title);

    console.log(documentToSource(window.document));
});

Above, we first read html source from disk into a string, then we invoke jsdom with that source. It calls us back when everything is finished, with a window object. The function call_jsdom ensured that jQuery is already loaded “into” that window, so we only need to access window.$ and work with jQuery as we would in a browser: The document does not yet have a heading, so we read the title and put it into the empty h1 tag. Finally, we log the transformed HTML to the console. You can download the project jsdom_demo to try it out; run transform.js on the shell, either directly or via Node.js. The input is:

<!doctype html>
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
        <title>My document</title>
    </head>
    <body>
        <h1></h1>
    </body>
</html>

The output is:

<!doctype html>
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
        <title>My document</title>
    </head>
    <body>
        <h1>My document</h1>
    </body>
<script src="jquery-1.7.1.min.js"></script></html>


Caveats

Keeping the structure of the source code. The original source code will be changed in several ways: Closing tags will be added (e.g. to close a <p> tag) and loading jQuery causes a script tag to be added (see output above). A possible work-around for transforming HTML (as opposed to extracting data) is to not work with a complete document. Instead, one can use $() to work with an HTML fragment that is separate from the document:

var fragment = $("<ul><li>item</li></ul>");

Seeing thrown exceptions. jsdom catches all exceptions. Unfortunately that catching extends to its callbacks. For example, the following is a function that we have called previously.

function call_jsdom(source, callback) {
    jsdom.env(
        source,
        [ 'jquery-1.7.1.min.js' ],  // (*)
        function(errors, window) {  // (**)
            process.nextTick(
                function () {
                    if (errors) {
                        throw new Error("There were errors: "+errors);
                    }
                    callback(window);
                }
            );
        }
    );
}

jsdom swallows all exceptions thrown inside the callback at (**), including in any functions that it calls. To escape that effect, you can use process.nextTick() to add a function to the event loop queue. It will be executed after the current code is finished.

Loading jQuery from a file. The examples in the jsdom readme load jQuery from a URL, causing internet traffic each time the code is run. A solution is to put a copy of jQuery next to the script and specify a file path instead of a URL, as seen above at (*).

Using jQuery multiple times. Do you have to invoke call_jsdom (or jsdom.env) every time you want to use jQuery? No, you can store window somewhere and use it again later. The initial startup is only callback-based to accommodate asynchronous script loading.

Conclusion: What is this good for?

When you are faced with having to parse or transform HTML, you realize just how great a transformation language jQuery is. Even more so, because its documentation is so well done, perfect for casual users. The solution described above is ideal for extracting information from HTML. Changing existing HTML requires more care.


Source: http://www.2ality.com/2012/02/jsdom.html
Published at DZone with permission of Axel Rauschmayer, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)