How to get XHTML file to object in java and how to work with it?

Question

I got XHTML file .hocr from tesseract 3.03 on Ubuntu 14.04LTS. How can I put data from this file to an object in java? Or how else I can work with this? Unfortunatelly for me, Im unexperienced with working with XML files, so any help would be much appreciated.

example of file:


  
    
      
        NAME
        FIRSTNAME

Unique identificator should be "word_1_X" where the X stands for number.

Point is to get NAME and FIRSTNAME and their possitions in picture. For example:

word_1_1 has X1=250 Y1=192

X2=1606 Y2=375

string value NAME.

Any ideas how to simply achieve this?

BalusC · Accepted Answer

You normally use a XML parser to parse XML files.

But as it appears to be actually a HTML file (most likely just the HTML output produced by a XHTML file as part of a JSF web application), then you'd better use a HTML parser.

There are many HTML parsers, one of them most suitable for the task of parsing real world HTML files and extracting specific data would be Jsoup.

Provided that the HTML output is available on the URL http://example.com/some/page.jsf, here's how you could use Jsoup to extract the data of interest:

Document document = Jsoup.connect("http://example.com/some/page.jsf").get();

for (Element ocrxWord : document.select(".ocrx_word")) {
    String text = ocrxWord.text(); // NAME, FIRSTNAME, etc
    String title = ocrxWord.attr("title"); // bbox 250 192 1606 375; x_wconf 70, etc
    // ...
}

After having the title, it would be just a matter of using basic java.lang.String methods to breakdown it further in smaller parts. That responsibility is beyond the scope of the HTML parser, any Java beginner is able to figure it on their own.

How to get XHTML file to object in java and how to work with it?

Answers (1)

Related Questions