Reputation: 11788
I want to extract text which is inside some tags like <dt>
, <dd>
, etc. from HTML files using Apache Tika.
So I am writing custom ContentHandler
which is supposed to extract information from these tags.
My custom ContentHandler
code looks like below. It is not yet complete but its already not working as expected :
public class TableContentHandler implements ContentHandler {
// key = abbreviation
// value = information / description for abbreviation
private Map<String, String> abbreviations = new HashMap<String, String>();
// current abbreviation
private String abbreviation = null;
// <dd> element contains abbreviation. So this boolean variable will be set when
// <dd> element is found
private boolean ddElementStarted = false;
// this method is not giving contents within <dd> and </dd> tags
public void characters(char[] chars, int arg1, int arg2) throws SAXException {
if(ddElementStarted) {
System.out.println("chars found...");
}
}
// set boolean ddElementStarted to true to indicate that content handler found
// <dd> element
public void startElement(String arg0, String element, String arg2, Attributes arg3) throws SAXException {
if(element.equalsIgnoreCase("dd")) {
ddElementStarted = true;
}
}
}
Here my assumption is that as soon as content handler goes inside startElement()
method and element name is dd
then I will set ddElementStarted = true
and then to get contents inside <dd>
and </dd>
element, I will check in characters()
method.
In characters()
method I am checking if ddElementStarted = true
and chars
array will contents within <dd>
and </dd>
element, but it is not working :(
I would like to know if
XPath
expressions in Apache Tika? I am not able to find this information in Tika in Action
book.Upvotes: 0
Views: 1984