dacmacho
dacmacho

Reputation: 45

Parsing complicated XML using Jsoup

I'm trying to parse an XML-formatted document with Jsoup, specifically what is located in the paragraph tag in the example code show below.

...
<nitf:body.content>
     <p> Content would be here. </p>
</nitf:body.content>
...

There are multiple paragraph tags in the document. As a result, I chose to use selector-syntax to get inside the body.content tag and then the paragraph tag underneath it. I am trying and failing to get it right now with:

// epochFileDoc is the name of the document with the code shown above.
Element tag_element = epochFileDoc.selectFirst("nitf|body.content > p");

I have tried a few different combinations of the selector syntax, including "nitf|content.body > p" and "nitf|body > p". None of the ones I have tried have worked.

How would I use selector-syntax in Jsoup to get the paragraph tag shown above?

EDIT: I see why content.body does not work in the selector syntax, since that searches for nitf:content="body" in the tags, but I'm still lost on how to get that element.

Upvotes: 1

Views: 92

Answers (2)

Hannes Erven
Hannes Erven

Reputation: 1125

@dacmacho's explanation is correct and the workaround will do, if you can modify the data before parsing it.

There now is a less invasive solution: I've just pushed a pull request ( https://github.com/jhy/jsoup/pull/1442 ) to JSoup, enabling the use of escape backslashes within the selector for element-names and CSS-identifiers.

So with that change, you'd simply use (note the backslash right before the dot):

Element tag_element = epochFileDoc.selectFirst("nitf|body\.content > p");

Upvotes: 1

dacmacho
dacmacho

Reputation: 45

The reason why it is not possible to select using a CSS selector, like Jsoup uses, is because a dot has a special meaning in CSS (like @Shlomi Fish said). In my code, I replaced instances of nitf:body.content with nitf:body-content using the line below, where file is the string where the XML is stored:

file = file.replace("<nitf:body.", "<nitf:body-");

This allowed me to select the Element using:

Element tag_element = epochFileDoc.selectFirst("nitf|body-content > p");

It would be smarter to use a different parser for XML-formatted code in cases like this, but if you have requirements like mine/want to keep Jsoup this workaround works properly.

Upvotes: 0

Related Questions