Reputation: 59
I have an XML file that looks about like this:
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2010-03-19" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2010-03-19 http://schema.primaresearch.org/PAGE/gts/pagecontent/2010-03-19/pagecontent.xsd" pcGtsId="pc-00530982">
<Metadata>
<Page imageFilename="00530982.tif" imageWidth="3346" imageHeight="5328">
<TextRegion id="r2" readingDirection="left-to-right" type="paragraph">
<Coords>
<Point x="94" y="3372"/>
<Point x="356" y="3375"/>
<Point x="326" y="3375"/>
<Point x="317" y="3369"/>
<Point x="160" y="3368"/>
<Point x="152" y="3368"/></Coords>
<TextEquiv>
<Unicode>Obl. Atl. Gr. W. Spw. 7 pCt. 52½, ⅞, ¾; Debentures Dito 8 pCt.
59½, 60¾, 59½; Obl. St. Paul en Pacific Spw. 7 pCt. 56¼ Nieuwe
Russen 1866 154¾, 155.</Unicode></TextEquiv></TextRegion>
</Page>
Now, what I need to do is extract the x and y coordinates for a set of preselected TextRegion IDs.
For a start, I tried
x <- as.numeric(unlist(sapply(xmlChildren(gt[["Page"]][["TextRegion"]][["Coords"]]), xmlGetAttr, "x")))
y <- as.numeric(unlist(sapply(xmlChildren(gt[["Page"]][["TextRegion"]][["Coords"]]), xmlGetAttr, "y")))
That works fine, but this only gives me the coordinates of the first TextRegion. I need to be able to get the values for any given ID. How do I do that?
I tried
coords <- as.data.frame(unlist(xpathSApply(gt, "//TextRegion[@id ='r2']/Coords/Point", xmlGetAttr, "x")))
but I only get an empty data frame.
What am I missing here?
Upvotes: 0
Views: 63
Reputation: 7232
Just add the namespaces:
ns <- "http://schema.primaresearch.org/PAGE/gts/pagecontent/2010-03-19"
xpathSApply(
gt, "//x:TextRegion[@id ='r2']/x:Coords/x:Point",
namespaces = c(x = ns),
xmlGetAttr, "x"))
Upvotes: 1