Reputation: 3614
I m trying to extract content based on given xpath. When it is just one element i want to extract, there is no issue. When I have a list of items matching that xpath, then i get the nodelist and i can extract the values.
However, there are a couple items related to each other forming a group, and that group repeats itself.
One way I could do is to get the nodelist of parent node of all such groups and then apply SAX based parsing technique to extract information. But this would introduce pattern specific coding. I want to make it generic. ex.
<html><body>
<!--... a lot divs and other tags ... -->
<div class="divclass">
<item>
<item_name>blah1</item_name>
<item_qty>1</item_qty>
<item_price>100</item_price>
</item>
</div>
<div class="divclass">
<item>
<item_name>blah2</item_name>
<item_qty>2</item_qty>
<item_price>200</item_price>
</item>
</div>
<div class="divclass">
<item>
<item_name>blah3</item_name>
<item_qty>3</item_qty>
<item_price>300</item_price>
</item>
</div>
</body></html>
I could easily write code for this xml but not a generic one which could parse any given specification.
I should be able to create a list
of map
of attribute-value
from above.
Has anyone tried this?
EDIT List of input xpaths:
1. "html:div[@class='divclass']/item/item_name"
2. "html:div[@class='divclass']/item/item_qty"
3. "html:div[@class='divclass']/item/item_price"
Expected output in simple text:
item_name:blah1;item_qty:1;item_price:100
item_name:blah2;item_qty:2;item_price:200
item_name:blah3;item_qty:3;item_price:300
Key point here is, if I apply each xpath separately, it would fetch me results vertically, i.e. first one will fetch all item_names, second will fetch all qtys. So I'll loose the co-relation within these pieces.
Hope this clears my requirements.
Thanks Nayn
Upvotes: 0
Views: 1154
Reputation: 5090
I think XQuery is a great solution for screen scraping. You can use the Saxon processor for executing your xqueries. Moreover, you can use Piggy Bank Firefox extension to easily find the XPath expressions, regarding the content you want to extract from a web page, that you can use within your xqueries.
Upvotes: 2
Reputation: 1192
Why not apply XPath in two steps.
First an XPath(s) to get the records (the lines in your output):
//div[@class='divclass']/item
Then the XPath(s) to get the fields (the columns), relative to each record:
item_name
item_qty
item_price
Here's working code (in Javascript, Windows scripting), gives you the output you want:
var doc = new ActiveXObject("MSXML.DOMDocument");
doc.load("test.xml");
// XPATH #1
var recordXPath = "//div[@class='divclass']/item";
// XPATHS #2, in a dictionary ("field name":"XPath")
var fieldXPaths = { item_name : "item_name",
item_qty : "item_name",
item_price : "item_price" };
var items = doc.selectNodes(recordXPath);
for (var itemCtr = 0; itemCtr < items.length; itemCtr++) {
var item = items[itemCtr];
var fieldEntries = [];
for (var fieldName in fieldXPaths) {
var fieldXPath = fieldXPaths[fieldName];
var fieldNode = item.selectSingleNode(fieldXPath);
fieldEntries.push(fieldNode.tagName + ":" + fieldNode.text);
}
WScript.Echo(fieldEntries.join(";"));
}
Upvotes: 1
Reputation: 128
I don't know if this helps but I use XSLT to go go the other way from data to HTML. Seems to me that you just need to structure the XPATH execution a little and XSLT is good for this.
Upvotes: 0
Reputation: 15853
I am not sure I got your question, but it sounds like you want to use XPath on HTML documents.
To use XPath, the HTML document being prased needs to be well-formed. There are several HTML parsers for Java; this article compares 4 of them.
HtmlCleaner seems to provide what you are after. It allows a subset of XPaths to be performed on "cleaned-up" HTML documents. Apparently it doesn't support the full set of XPath expressions though, see the documentation.
If you require more complex XPath expressions than what HtmlCleaner supports, you may need to use the javax.xml.xpath package with a well-formed XHTML document. JTidy can convert an HTML document to an XHTML one.
I hope this answers your question.
Upvotes: 3
Reputation: 167446
I don't understand what you want to achieve and how it relates to XPath. If you want to map XML to Java objects then JAXB might help, but it is based on XML schemas, not on XPath.
Upvotes: 0