Reputation: 4232
I somehow can't manage to extract information from AWIS results (containing Alexa data).
I've a bunch of XML
files containing AWIS data from which I want to extract information bits such as Rank and PageViews for 3 month period.
The two (colliding) namespaces are somehow confusing and my XPath
expressions are not working as intended. (Even a simple //aws:Rank/text()
is not working.)
It would be great if somebody could assist me to get going.
Currently, I am using jdom
library, but wouldn't mind using something else. This is a tiny example that does not work as suspected:
Document doc = new SAXBuilder().build(file);
XPath xpath = XPath.newInstance("//aws:Rank");
xpath.addNamespace("aws", "http://awis.amazonaws.com/doc/2005-07-11/");
Element rank = (Element) xpath.selectSingleNode(doc);
I'd prefer to use javax.xml
though...
Here's an example of the XML
:
<?xml version="1.0"?>
<aws:UrlInfoResponse xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/">
<aws:Response xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11">
<aws:OperationRequest>
<aws:RequestId>XXXX-XXXX-XXXX-XXXX-XXXX</aws:RequestId>
</aws:OperationRequest>
<aws:UrlInfoResult>
<aws:Alexa>
<aws:ContactInfo>
<aws:DataUrl type="canonical">ahparis.com</aws:DataUrl>
<aws:PhoneNumbers>
<aws:PhoneNumber>+33 140289796</aws:PhoneNumber>
</aws:PhoneNumbers>
<aws:OwnerName>John Fay</aws:OwnerName>
<aws:Email>hostmaster@superbregistrar.net</aws:Email>
<aws:PhysicalAddress>
<aws:Streets>
<aws:Street>22 rue Saint Sauveur</aws:Street>
</aws:Streets>
<aws:City>Paris 75002,</aws:City>
<aws:Country>FRANCE</aws:Country>
</aws:PhysicalAddress>
<aws:CompanyStockTicker/>
</aws:ContactInfo>
<aws:ContentData>
<aws:DataUrl type="canonical">ahparis.com</aws:DataUrl>
<aws:SiteData>
<aws:Title>Ah Paris</aws:Title>
<aws:Description>Short term apartment rentals. Search engine, descriptions, photos, rates.</aws:Description>
<aws:OnlineSince>26-Feb-2003</aws:OnlineSince>
</aws:SiteData>
<aws:Keywords>
<aws:Keyword>Fran̤ais</aws:Keyword>
<aws:Keyword>Ile-de-France</aws:Keyword>
</aws:Keywords>
<aws:OwnedDomains>
<aws:OwnedDomain>
<aws:Domain>paris-tournament.org</aws:Domain>
<aws:Title>paris-tournament.org</aws:Title>
</aws:OwnedDomain>
</aws:OwnedDomains>
</aws:ContentData>
<aws:TrafficData>
<aws:DataUrl type="canonical">ahparis.com</aws:DataUrl>
<aws:Rank>2547606</aws:Rank>
<aws:RankByCountry/>
<aws:RankByCity/>
<aws:UsageStatistics>
<aws:UsageStatistic>
<aws:TimeRange>
<aws:Months>3</aws:Months>
</aws:TimeRange>
<aws:Rank>
<aws:Value>2547606</aws:Value>
<aws:Delta>-658661</aws:Delta>
</aws:Rank>
<aws:Reach>
<aws:Rank>
<aws:Value>2964984</aws:Value>
<aws:Delta>-152875</aws:Delta>
</aws:Rank>
<aws:PerMillion>
<aws:Value>0.28</aws:Value>
<aws:Delta>-10.64%</aws:Delta>
</aws:PerMillion>
</aws:Reach>
<aws:PageViews>
<aws:PerMillion>
<aws:Value>0.01</aws:Value>
<aws:Delta>+100%</aws:Delta>
</aws:PerMillion>
<aws:Rank>
<aws:Value>2143379</aws:Value>
<aws:Delta>-1628449</aws:Delta>
</aws:Rank>
<aws:PerUser>
<aws:Value>4.0</aws:Value>
<aws:Delta>+120%</aws:Delta>
</aws:PerUser>
</aws:PageViews>
</aws:UsageStatistic>
<aws:UsageStatistic>
<aws:TimeRange>
<aws:Months>1</aws:Months>
</aws:TimeRange>
<aws:Rank>
<aws:Value>1430628</aws:Value>
<aws:Delta>-3224794</aws:Delta>
</aws:Rank>
<aws:Reach>
<aws:Rank>
<aws:Value>1656655</aws:Value>
<aws:Delta>-5103474</aws:Delta>
</aws:Rank>
<aws:PerMillion>
<aws:Value>0.5</aws:Value>
<aws:Delta>+500%</aws:Delta>
</aws:PerMillion>
</aws:Reach>
<aws:PageViews>
<aws:PerMillion>
<aws:Value>0.02</aws:Value>
<aws:Delta>+100%</aws:Delta>
</aws:PerMillion>
<aws:Rank>
<aws:Value>1279227</aws:Value>
<aws:Delta>-859817</aws:Delta>
</aws:Rank>
<aws:PerUser>
<aws:Value>4</aws:Value>
<aws:Delta>-63.11%</aws:Delta>
</aws:PerUser>
</aws:PageViews>
</aws:UsageStatistic>
<aws:UsageStatistic>
<aws:TimeRange>
<aws:Days>7</aws:Days>
</aws:TimeRange>
<aws:Rank>
<aws:Value>1927968</aws:Value>
<aws:Delta>+757770</aws:Delta>
</aws:Rank>
<aws:Reach>
<aws:Rank>
<aws:Value>2942088</aws:Value>
<aws:Delta>+1612570</aws:Delta>
</aws:Rank>
<aws:PerMillion>
<aws:Value>0.3</aws:Value>
<aws:Delta>-64.64%</aws:Delta>
</aws:PerMillion>
</aws:Reach>
<aws:PageViews>
<aws:PerMillion>
<aws:Value>0.05</aws:Value>
<aws:Delta>+80%</aws:Delta>
</aws:PerMillion>
<aws:Rank>
<aws:Value>708394</aws:Value>
<aws:Delta>-413955</aws:Delta>
</aws:Rank>
<aws:PerUser>
<aws:Value>10</aws:Value>
<aws:Delta>+400%</aws:Delta>
</aws:PerUser>
</aws:PageViews>
</aws:UsageStatistic>
</aws:UsageStatistics>
<aws:ContributingSubdomains>
<aws:ContributingSubdomain>
<aws:DataUrl>ahparis.com</aws:DataUrl>
<aws:TimeRange>
<aws:Months>1</aws:Months>
</aws:TimeRange>
<aws:Reach>
<aws:Percentage>100.00%</aws:Percentage>
</aws:Reach>
<aws:PageViews>
<aws:Percentage>100.00%</aws:Percentage>
<aws:PerUser>4</aws:PerUser>
</aws:PageViews>
</aws:ContributingSubdomain>
</aws:ContributingSubdomains>
</aws:TrafficData>
</aws:Alexa>
</aws:UrlInfoResult>
<aws:ResponseStatus xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/">
<aws:StatusCode>Success</aws:StatusCode>
</aws:ResponseStatus>
</aws:Response>
</aws:UrlInfoResponse>
Upvotes: 0
Views: 344
Reputation: 122414
It looks like a typo in the namespace URI - your code has
xpath.addNamespace("aws", "http://awis.amazonaws.com/doc/2005-07-11/");
(with a trailing slash) but the document has
xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11"
(without the slash).
I'd prefer to use javax.xml though...
Namespace handling is a real pain in javax.xml.xpath
, because there's no default implementation of the NamespaceContext
interface provided in the Java class library. You have to either implement your own or use a third-party implementation (I usually go for the SimpleNamespaceContext
from Spring). If you're going to be doing a lot of XPath manipulation I'd suggest looking at Saxon 9 (the HE version is free of charge) and use its s9api, as this supports the far more powerful version 2.0 of the XPath language.
Upvotes: 1
Reputation: 17707
You hve a typo in your code. You have:
xpath.addNamespace("aws", "http://aws.amazonaws.com/doc/2005-07-11/");
but you should have:
xpath.addNamespace("aws", "http://awis.amazonaws.com/doc/2005-07-11/");
(note the change from aws
to awis
).
Additionally, you should really be using JDOM 2.5, and the new XPath API that was introduced there. The JDOM 2.x versions have significantly better handling for namespaces, and generics on the resulting content. See The changes in JDOM2.x XPath handling
Upvotes: 2
Reputation: 7173
I tried this using your input with xslt with the following stylesheet:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:alex="http://alexa.amazonaws.com/doc/2005-10-05/"
xmlns:awis="http://awis.amazonaws.com/doc/2005-07-11"
version="1.0">
<xsl:output omit-xml-declaration="yes"/>
<xsl:template match="/">
<xsl:value-of select="//awis:Rank/text()"/>
</xsl:template>
</xsl:stylesheet>
and somehow I got an output of:
2547606
I suppose you have to register the namespaces in different prefixes, then use that in your xpath
Upvotes: 1