Apurv
Apurv

Reputation: 3753

Java library for HTML to Java (POJO) conversion

Using Apache Velocity Api we can combine Java objects (Lists, POJOs etc.) with a (HTML) template and create the (HTML) output.

Is there any Java API that can help reverse engineer this ? The input of this API could be HTML output and the template used, the output should be the data (in Java/XML format) that was used to generate the output.

I am aware of HTTP Unit API, but this just lets me extract HTML elements (like Tables). I am looking for something that extracts the data based on some template.

Upvotes: 0

Views: 4714

Answers (2)

Louis-wht
Louis-wht

Reputation: 555

My answer won't probably be useful to the writer of this question (I have 5 years late so not the right timing I guess) but as this is the first result I found on Google when typing HTML to POJO, I think it will probably be useful for many other developers that might come across this answer.

Today, I just released (in the name of my company) an HTML to POJO complete framework that you can use to map HTML to any POJO class with simply some annotations. The library itself is quite handy and features many other things all the while being very pluggable. You can have a look to it right here : https://github.com/whimtrip/jwht-htmltopojo

How to use : Basics

Imagine we need to parse the following html page :

<html>
    <head>
        <title>A Simple HTML Document</title>
    </head>
    <body>
        <div class="restaurant">
            <h1>A la bonne Franquette</h1>
            <p>French cuisine restaurant for gourmet of fellow french people</p>
            <div class="location">
                <p>in <span>London</span></p>
            </div>
            <p>Restaurant n*18,190. Ranked 113 out of 1,550 restaurants</p>  
            <div class="meals">
                <div class="meal">
                    <p>Veal Cutlet</p>
                    <p rating-color="green">4.5/5 stars</p>
                    <p>Chef Mr. Frenchie</p>
                </div>

                <div class="meal">
                    <p>Ratatouille</p>
                    <p rating-color="orange">3.6/5 stars</p>
                    <p>Chef Mr. Frenchie and Mme. French-Cuisine</p>
                </div>

            </div> 
        </div>    
    </body>
</html>

Let's create the POJOs we want to map it to :

public class Restaurant {

    @Selector( value = "div.restaurant > h1")
    private String name;

    @Selector( value = "div.restaurant > p:nth-child(2)")
    private String description;

    @Selector( value = "div.restaurant > div:nth-child(3) > p > span")    
    private String location;    

    @Selector( 
        value = "div.restaurant > p:nth-child(4)"
        format = "^Restaurant n\*([0-9,]+). Ranked ([0-9,]+) out of ([0-9,]+) restaurants$",
        indexForRegexPattern = 1,
        useDeserializer = true,
        deserializer = ReplacerDeserializer.class,
        preConvert = true,
        postConvert = false
    )
    // so that the number becomes a valid number as they are shown in this format : 18,190
    @ReplaceWith(value = ",", with = "")
    private Long id;

    @Selector( 
        value = "div.restaurant > p:nth-child(4)"
        format = "^Restaurant n\*([0-9,]+). Ranked ([0-9,]+) out of ([0-9,]+) restaurants$",
        // This time, we want the second regex group and not the first one anymore
        indexForRegexPattern = 2,
        useDeserializer = true,
        deserializer = ReplacerDeserializer.class,
        preConvert = true,
        postConvert = false
    )
    // so that the number becomes a valid number as they are shown in this format : 18,190
    @ReplaceWith(value = ",", with = "")
    private Integer rank;

    @Selector(value = ".meal")    
    private List<Meal> meals;

    // getters and setters

}

And now the Meal class as well :

public class Meal {

    @Selector(value = "p:nth-child(1)")
    private String name;

    @Selector(
        value = "p:nth-child(2)",
        format = "^([0-9.]+)\/5 stars$",
        indexForRegexPattern = 1
    )
    private Float stars;

    @Selector(
        value = "p:nth-child(2)",
        // rating-color custom attribute can be used as well
        attr = "rating-color"
    )
    private String ratingColor;

    @Selector(
        value = "p:nth-child(3)"
    )
    private String chefs;

    // getters and setters.
}

We provided some more explanations on the above code on our github page.

For the moment, let's see how to scrap this.

private static final String MY_HTML_FILE = "my-html-file.html";

public static void main(String[] args) {


    HtmlToPojoEngine htmlToPojoEngine = HtmlToPojoEngine.create();

    HtmlAdapter<Restaurant> adapter = htmlToPojoEngine.adapter(Restaurant.class);

    // If they were several restaurants in the same page, 
    // you would need to create a parent POJO containing
    // a list of Restaurants as shown with the meals here
    Restaurant restaurant = adapter.fromHtml(getHtmlBody());

    // That's it, do some magic now!

}


private static String getHtmlBody() throws IOException {
    byte[] encoded = Files.readAllBytes(Paths.get(MY_HTML_FILE));
    return new String(encoded, Charset.forName("UTF-8"));

}

Another short example can be found here

Hope this will help someone out there!

Upvotes: 0

dinesh707
dinesh707

Reputation: 12592

You can use google protobuf in order to convert messages for different types. And it is very easy to define templates as well. I create JavaScript Objects using JSON.parse(), and in Java you can use protobuf to convert JSON to Java objects.

  1. http://code.google.com/p/protobuf/
  2. http://code.google.com/p/protobuf-java-format/

Upvotes: 1

Related Questions