Milan
Milan

Reputation: 11

R: get node and child node from XML-file. error: subscript out of bounds

I'm using R to transform a database with XML-files into a CSV-file. I don't need everything from the XML-files, just some of the nodes.

My XML-data looks like this (simplified example):

<root>
<meta>
    <dc:identifier>unique identifier 001</dc:identifier>
    <dc:format>text/xml</dc:format>
    <dc:type>Verbatim Proceedings</dc:type>
    <dc:date>1963-06-05</dc:date>
</meta>
<proceedings>
<speech pm:speaker="Oudman" pm:function="voorzitter" pm:role="chair" pm:party-ref="nl.p.pvda" pm:member-ref="nl.m.00661" pm:id="nl.proc.sgd.d.19630000002.1.2">
            <p pm:id="nl.proc.sgd.d.19630000002.1.2.1">Ik deel aan de Kamer mede, dat zijn ingekomen berichten van verhindering tot bijwoning der vergadering van:</p>
            <p pm:id="nl.proc.sgd.d.19630000002.1.2.2">de heer Niers, wegens dringende bezigheden elders; de heren Van Hall, De Wilde, Van Riel, Cammelbeeck, Van der Waerden en Derksen, wegens verblijf buitenslands.</p>
            <p pm:id="nl.proc.sgd.d.19630000002.1.2.3">Deze berichten worden voor kennisgeving aangenomen.</p>
        </speech>
        <speech pm:speaker="Oudman" pm:function="voorzitter" pm:role="chair" pm:party-ref="nl.p.pvda" pm:member-ref="nl.m.00661" pm:id="nl.proc.sgd.d.19630000002.1.3">
            <p pm:id="nl.proc.sgd.d.19630000002.1.3.1">Ik ben er dankbaar voor, dat de heer Tjalma aanwezig is, zodat wij hem kunnen gelukwensen met zijn 70ste verjaardag. Ook mejuffrouw Tjeenk Willink wensen wij geluk met haar verjaardag.</p>
            <p pm:id="nl.proc.sgd.d.19630000002.1.3.2">Voorts deel ik aan de Kamer mede, dat is ingekomen een afschrift van het Koninklijk besluit van 28 mei 1963, nr. 33, houdende benoeming van Mr. J. A. Jonkman tot voorzitter van de Eerste Kamer der Staten-Generaal voor de zitting, welke zal aanvangen op 5 juni 1963.</p>
</speech>
</proceedings>
</root>

When I try to get information from some of the nodes with the XML-package in R, there seems to be no problem at all, using this for-loop:

for (filename in files) {
  doc <- xmlTreeParse(filename, useInternalNodes=TRUE)
top=xmlRoot(doc)  
  xmlValue(getNodeSet(doc,"//dc:date")[[1]])-> date
}

When I try to get the information from the node speech and child nodes p in the same for-loop, R gives an error:

xmlValue(getNodeSet(doc,"//speech")[[1]]) -> speech
xmlValue(xmlChildren(doc[[speech]]), "p") -> speech

Error in getNodeSet(doc, "//speech")[[1]] : subscript out of bounds

I tried a lot of different methods, like this one:

xpathSApply(doc,"//speech",xmlValue) -> speech

But that generates an empty value.

I tried to understand why R generates this error. I thought it was because there are several child nodes

. But when I try this, I just get all the content of the XML-file (using /root works fine):

xmlValue(getNodeSet(doc, "/root")[[1]])-> root

Why I can't get the content of speech and p like the other nodes?

I hope someone can help me with this, and I hope I gave enough information about my problem with R.

Upvotes: 1

Views: 1434

Answers (2)

Milan
Milan

Reputation: 11

I think I've found the solution, thanks to the help from jdharrison.

xmlValue(getNodeSet(doc, "//*[local-name()='proceedings']")[[1]]) -> speech

This generates an output in text, just what I needed. Thanks!

-M.

Upvotes: 0

jdharrison
jdharrison

Reputation: 30445

Define the appropriate namespace:

getNodeSet(doc, "//x:speech", namespaces = c(x = "http://www.mynamespace.com/jjjjaa/etc"))

or use the local name:

doc["//*[local-name()='speech']"]

Upvotes: 1

Related Questions