Simplest way to extract information (parsing) from HTML in Java

Question

I've read a lot of questions on stackoverflow regarding html parsing. I've learned that, when possible, we should avoid regex and use a parser instead. I know that there are a lot of Html/Xml parser but I don't know how to use them properly.

Consider this html, parsed through jTidy. I've got a Document object created by jTidy of this code:

Now, I would like to map (in a Map :D ) each filename with its class (success/fail). I can do it with DOM, but I should create a NodeList and for each Element create a new nodelist (lots of memory and boring). There are alternatives like Sax, Xerces etc etc. but I don't know advantages/disadvantages of them.

What is the simplest (and fastest) way to extract those information from the "jTyded" html above?

vacuum · Accepted Answer

First of all - you forgot to add

tag.

You can very easy parse you code with Jsoup

Here is an example:

//  String html =" ...here goes your html code... ";
// Document doc = Jsoup.parse(html);
// Or from file:
    File input = new File("com.htm");
    Document doc = Jsoup.parse(input, "UTF-8");
    Elements trs = doc.select("tr"); //select all "tr" elements from document
    for(Element tr:trs){
        //Getting the class string form tr element
        System.out.println("The file class is: " + tr.attr("class") 
       //getting the filename string that holds inside td element
         + " The filamee is: "  + tr.select("td").text());
    }
}

Simplest way to extract information (parsing) from HTML in Java

Answers (2)

Related Questions