HTML5 Zone is brought to you in partnership with:

Enterprise Architect in HCL Technologies a $7Billion IT services organization. My role is to work as a Technology Partner for large enterprise customers providing them low cost opensource solutions around Java, Spring and vFabric stack. I am also working on various projects involving, Cloud base solution, Mobile application and Business Analytics around Spring and vFabric space. Over 23 yrs, I have build repository of technologies and tools I liked and used extensively in my day to day work. In this blog, I am putting all these best practices and tools so that it will help the people who visit my website. Krishna is a DZone MVB and is not an employee of DZone and has posted 64 posts at DZone. You can read more from them at their website. View Full User Profile

Jsoup: A Nice Way to do HTML Parsing in Java

12.10.2012
| 10929 views |
  • submit to reddit

Typically you do HTML parsing in Java for various reasons like JUnit testing, Web Crawling and others. I stumbled across JSoup and tried few things to understand its capabilities. If you do some googling you can come across few good articles in Stackoverflow like, What is a good java web crawler library? and JSoup vs HttpUnit.

I had already worked with HttpUnit extensively. I felt that JSoup is better than HttpUnit. Let me demonstrate few of the capabilities of Jsoup in this blog,

Connecting to any website and parsing the data from that website into a DOM tree is as simple as,

URL url = new URL(
"http://gosmarter.net?query=cars");
Document doc = Jsoup.parse(url, 3000);

Where the integer value passed in the parse method is the timeout period set to return downloading from the site if it takes more time.

If you want to retrieve a table or a div from the DOM tree you do as below,

Iterator<Element> productList = doc.select(
"div[class=productList]").iterator();
assertNotNull(productList.hasNext);
while (productList.hasNext()) {
//Do some processing
}

If you want to extract an Image URL you do this way,

Element productLink = product.select("a").first();
String href = productLink.attr("abs:href");

Note in the above code, “abs:href”, will return the absolute url if the path is relative. Also the Element class is jsoup class, this has capabilities like select method, which is used to query based on intelligent jsoup query language. It also has a attr method, where, for a given element we can retrieve a specific attribute, in this example, we are retrieving href attribute of “a” link html tag. The first method returns always the 1st element, if there are lot of “td” or “tr” or a “li” html tag.

You can also get a specific element in a “td” or a “tr” or a “li” html tag as below,

Element descLi = product.select( "li:eq(0)").first();

Note above the select query is requesting 1st element or 0 index element from the list. The syntax is like “li:eq(0)”.

You can retrieve the text within a tag, for example, if you want to retrieve the text in the “a” link html tag, you do as below,

Element descA = product.select( "a").first();
String desc = descA.text();

Note text method is used to retrieve the text.

Finally if you want to retrieve an entire html content of a element you can do as below,

Element descA = product.select( "a").first();
String descHtmlData = descA.html();

Note you use html method to achieve retrieving html content of an element. This is useful for debug purpose.

There is also maven jar available in Apache Maven repository as below,

<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.7.1</version>
</dependency>

I hope this blog helped you.



Published at DZone with permission of Krishna Prasad, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)