Reputation: 22008
I get the following XML which represents a news article:
<content>
Some text blalalala
<h2>Small subtitle</h2>
Some more text blbla
<ul class="list">
<li>List item 1</li>
<li>List item 2</li>
</ul>
<br />
Even more freakin text
</content>
I know the format isn't ideal but for now I have to take it.
The Article should look like:
I parse this XML with Jsoup. I can get the text within the <content>
tag with doc.ownText()
but then I have no idea where the other stuff (subtitle) is placed, I get only one big String
.
Would it be better to use an event based parser for this (I hate them :() or is there a possibility to do something like doc.getTextUntilTagAppears("tagName")
?
Edit: For clarification, I know hot to get the elements under <content>
, my problem is with getting the text within <content>
, broken up every time when its interrupted by an element.
I learned that I can get all the text within content with .textNodes()
, works great, but then again I don't know where which text node belongs in my article (one at the top before h2, the other one at the bottom).
Upvotes: 4
Views: 9688
Reputation: 22008
The mistake I made was going through the XML by Elements
, which do not include TextNodes
. When I go through it Node by Node, I can check wether the Node
is an Element
or a TextNode
, that way I can treat them accordingly.
Upvotes: 3
Reputation: 1263
Jsoup has a fantastic selector based syntax. See here
If you want the subtitle
Document doc = Jsoup.parse("path-to-your-xml"); // get the document node
You know that subtitle is in the h2
element
Element subtitle = doc.select("h2").first(); // first h2 element that appears
And if you like to have the list:
Elements listItems = doc.select("ul.list > li");
for(Element item: listItems)
System.out.println(item.text()); // print list's items one after another
Upvotes: 9