Reputation: 91
for Example:
<div>
this is first
<div>
second
</div>
</div>
I am working on Natural Language Processing and I have to translate a website(not by using Google Translate) for which i have to extract both sentences "this is first" and "second" separately so that i can replace them with other language text in respective divs. If i extract text for first it will show "this is first second" and if I using recursion to dig deeper, it will only extract "second"
Help me out please!
EDIT
Using ownText() method will create problem in the following html code:
<div style="top:+0.2em; font-size:95%;">
the
<a href="/wiki/Free_content" title="Free content">
free
</a>
<a href="/wiki/Encyclopedia" title="Encyclopedia">
encyclopedia
</a>
that
<a href="/wiki/Wikipedia:Introduction" title="Wikipedia:Introduction">
anyone can edit
</a>
.
</div>
It will print:
the that.
free
encyclopedia
anyone can edit
But it must be:
the
that
.
encyclopedia
anyone can edit
Upvotes: 1
Views: 1711
Reputation: 25350
If i extract text for first it will show "this is first second"
Use ownText()
instead of text()
and you'll get only the element contains directly.
Here's an example:
final String html = "<div>\n"
+ " this is first\n"
+ " <div>\n"
+ " second\n"
+ " </div>\n"
+ "</div>";
Document doc = Jsoup.parse(html); // Get your Document from somewhere
Element first = doc.select("div").first(); // Select 1st element - take the first found
String firstText = first.ownText(); // Get own text
Element second = doc.select("div > div").first(); // Same as above, but with 2nd div
String secondText = second.ownText();
System.out.println("1st: " + firstText);
System.out.println("2nd: " + secondText);
Upvotes: 2
Reputation: 2115
Elements divs=doc.getElementsByTag("div");
for (Element element : divs) {
System.out.println(element.text());
}
This should work if you are using jsoup HTML parser.
Upvotes: 0
Reputation: 1851
It seems like you're using textContent in the div's to extract the content, which will get you the content of that element, and all descendent elements. (Java: this would be the getTextContent method on the Element)
Instead examine the childNodes (Java: getChildNodes method on the Element). The nodes have a property "nodeType" (Java: getNodeType) which you can look at to work out whether the node is a Text Node (Java: Node.TEXT_NODE), or an Element (Java: Node.ELEMENT_NODE). So to take you example you have a tree of Nodes which look like this...
div (Element)
this is first (TextNode)
div (Element)
second (TextNode)
The outer div directly contains only two nodes - the first piece of text, and the inner div. That inner div then contains the text "second".
So loop over the nodes in the outer div, if the node is a text node, translate, otherwise recurse into the Element. Note that there are other kinds of nodes, Comments and the like, but for your purposes you can probably ignore those.
Assuming you're using the w3c DOM API http://docs.oracle.com/javase/7/docs/api/org/w3c/dom/Node.html
Upvotes: 1
Reputation: 6039
You can use XML parser, in whatever language you are using. Here is for Java: http://www.mkyong.com/java/how-to-read-xml-file-in-java-dom-parser/
Upvotes: 1