Reputation: 77

Remove or filter XML nodes by Xpaths from file in R

I have very very large complex xml files (look like this https://github.com/HL7/C-CDA-Examples/blob/master/General/Parent%20Document%20Replace%20Relationship/CCD%20Parent%20Document%20Replace%20(C-CDAR2.1).xml ) to process but only need attributes and values at particular XPaths (nodes). By removing unneeded nodes, processing time may be cut, filtering out fluff before detailed processing.

So far I have tried using: xml_remove

xmlfile <- paste0(dir,"xmlFiles/",filelist[k])
file<-read_xml(xmlfile)
file<-xml_ns_strip(file)

for(counx in 1:nrow(xpathTable)){   
        xr <- xml_find_all(file, xpath =paste0('/',toString(xpathTable$xpaths[counx])) )
        xml_remove(xr, free = TRUE)
        file<-file              
    }

This works well for removing few nodes but crashes as the numbers go up (>100)

Below show a kind of example of what I want to get too

<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
    <book category="cooking">
        <title lang="en">Everyday Italian</title>
        <author>Giada De Laurentiis</author>
        <year>2005</year>
        <price>30.00</price>
    </book>
    <book category="children">
        <title lang="en">Harry Potter</title>
        <author>J K. Rowling</author>
        <year>2005</year>
        <price>29.99</price>
        <ISBN>
            <Random>12354</Random>
        </ISBN>
    </book>
    <book category="web">
        <title lang="en">XQuery Kick Start</title>
        <author>James McGovern</author>
        <author>Per Bothner</author>
        <author>Kurt Cagle</author>
        <author>James Linn</author>
        <author>Vaidyanathan Nagarajan</author>
        <year>2003</year>
        <price>49.99</price>
    </book>
    <book category="web">
        <title lang="en">Learning XML</title>
        <author>Erik T. Ray</author>
        <year>2003</year>
        <ISBN>
            <Random>12345</Random>
        </ISBN>
        <price>39.95</price>
    </book>
</bookstore>

Filter by XPaths

/bookstore/book/title
/bookstore/book/year
/bookstore/book/ISBN/Random

<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
    <book category="cooking">
        <title lang="en">Everyday Italian</title>       
        <year>2005</year>
    </book>
    <book category="children">
        <title lang="en">Harry Potter</title>
        <year>2005</year>
        <ISBN>
            <Random>12354</Random>
        </ISBN>
    </book>
    <book category="web">
        <title lang="en">XQuery Kick Start</title>
        <year>2003</year>
    </book>
    <book category="web">
        <title lang="en">Learning XML</title>
        <year>2003</year>
        <ISBN>
            <Random>12345</Random>
        </ISBN>
    </book>
</bookstore>

Upvotes: 0

Answers (2)

LMC

Reputation: 12682

All elements could be looked up in a single XPath 1.0 expression valid for many languages:

/bookstore/book/descendant::*[name()="title" or name()="year" or name()="Random"]

Equivalent/similar expressions:

/bookstore/book/title | /bookstore/book/year | /bookstore/book/ISBN/Random
 //book/@category | //book/year | //ISBN/Random

To filter out elements:

//book/*[not(name()="title" or name()="year" or name()="ISBN" or name()="Random")]

For XMLs with namespaces, local-name() can be used instead of name() if namespace handling is not used.

For the given example and elements and testing on command line:

echo 'cat /bookstore/book/descendant::*[name()="title" or name()="year" or name()="Random"]' | xmllint --shell test.xml

Result:

/ > cat /bookstore/book/descendant::*[name()="title" or name()="year" or name()="Random"]
 -------
<title lang="en">Everyday Italian</title>
 -------
<year>2005</year>
 -------
<title lang="en">Harry Potter</title>
 -------
<year>2005</year>
 -------
<Random>12354</Random>
 -------
<title lang="en">XQuery Kick Start</title>
 -------
<year>2003</year>
 -------
<title lang="en">Learning XML</title>
 -------
<year>2003</year>
 -------
<Random>12345</Random>
/ >

For the mentioned R crash, worth looking here.

Upvotes: 0

wp78de

Reputation: 18950

Looks like an XQuery job, e.g. you could recreate your document like this

<bookstore>{
  for $book in /bookstore/*
  return <book category="{$book/@category}">
    {$book/title}
    {$book/year}
    {$book/ISBN}
  </book>
}</bookstore>

Using the book example to get the result below it. You can test this online here using XQuery as an option https://www.videlibri.de/cgi-bin/xidelcgi

There might be ways to run XQuery from R but I would rather do it in a pre-processing step from the command line using a tool like xidel.

Upvotes: 1

Remove or filter XML nodes by Xpaths from file in R

Answers (2)

Related Questions