Reputation: 33
I got XHTML file .hocr from tesseract 3.03 on Ubuntu 14.04LTS. How can I put data from this file to an object in java? Or how else I can work with this? Unfortunatelly for me, Im unexperienced with working with XML files, so any help would be much appreciated.
example of file:
<div class='ocr_page' id='page_1' title='image "test2jpg.jpg"; bbox 0 0 10000 10000; ppageno 0'>
<div class='ocr_carea' id='block_1_1' title="bbox 250 192 8637 686">
<p class='ocr_par' dir='ltr' id='par_1_1' title="bbox 250 192 8637 686">
<span class='ocr_line' id='line_1_1' title="bbox 250 192 8637 414; baseline 0 -40">
<span class='ocrx_word' id='word_1_1' title='bbox 250 192 1606 375; x_wconf 70' lang='eng' dir='ltr'>NAME</span>
<span class='ocrx_word' id='word_1_2' title='bbox 1676 192 3051 375; x_wconf 73' lang='eng' dir='ltr'><strong>FIRSTNAME</strong></span>
Unique identificator should be "word_1_X" where the X stands for number.
Point is to get NAME and FIRSTNAME and their possitions in picture. For example:
word_1_1 has X1=250 Y1=192
X2=1606 Y2=375
string value NAME.
Any ideas how to simply achieve this?
Upvotes: 1
Views: 2921
Reputation: 1108722
You normally use a XML parser to parse XML files.
But as it appears to be actually a HTML file (most likely just the HTML output produced by a XHTML file as part of a JSF web application), then you'd better use a HTML parser.
There are many HTML parsers, one of them most suitable for the task of parsing real world HTML files and extracting specific data would be Jsoup.
Provided that the HTML output is available on the URL http://example.com/some/page.jsf
, here's how you could use Jsoup to extract the data of interest:
Document document = Jsoup.connect("http://example.com/some/page.jsf").get();
for (Element ocrxWord : document.select(".ocrx_word")) {
String text = ocrxWord.text(); // NAME, FIRSTNAME, etc
String title = ocrxWord.attr("title"); // bbox 250 192 1606 375; x_wconf 70, etc
// ...
}
After having the title, it would be just a matter of using basic java.lang.String
methods to breakdown it further in smaller parts. That responsibility is beyond the scope of the HTML parser, any Java beginner is able to figure it on their own.
Upvotes: 1