fweigl
fweigl

Reputation: 22008

Parsing XML with Jsoup

I get the following XML which represents a news article:

<content>
   Some text blalalala
   <h2>Small subtitle</h2>
   Some more text blbla
   <ul class="list">
      <li>List item 1</li>
      <li>List item 2</li>
   </ul>
   <br />
   Even more freakin text
</content>

I know the format isn't ideal but for now I have to take it.

The Article should look like:

I parse this XML with Jsoup. I can get the text within the <content> tag with doc.ownText() but then I have no idea where the other stuff (subtitle) is placed, I get only one big String.

Would it be better to use an event based parser for this (I hate them :() or is there a possibility to do something like doc.getTextUntilTagAppears("tagName")?

Edit: For clarification, I know hot to get the elements under <content>, my problem is with getting the text within <content>, broken up every time when its interrupted by an element.

I learned that I can get all the text within content with .textNodes(), works great, but then again I don't know where which text node belongs in my article (one at the top before h2, the other one at the bottom).

Upvotes: 4

Views: 9688

Answers (2)

fweigl
fweigl

Reputation: 22008

The mistake I made was going through the XML by Elements, which do not include TextNodes. When I go through it Node by Node, I can check wether the Node is an Element or a TextNode, that way I can treat them accordingly.

Upvotes: 3

zEro
zEro

Reputation: 1263

Jsoup has a fantastic selector based syntax. See here

If you want the subtitle

Document doc = Jsoup.parse("path-to-your-xml"); // get the document node

You know that subtitle is in the h2 element

Element subtitle = doc.select("h2").first();  // first h2 element that appears

And if you like to have the list:

Elements listItems = doc.select("ul.list > li");
for(Element item: listItems)
    System.out.println(item.text());  // print list's items one after another

Upvotes: 9

Related Questions