Reputation: 77
I have very very large complex xml files (look like this https://github.com/HL7/C-CDA-Examples/blob/master/General/Parent%20Document%20Replace%20Relationship/CCD%20Parent%20Document%20Replace%20(C-CDAR2.1).xml ) to process but only need attributes and values at particular XPaths (nodes). By removing unneeded nodes, processing time may be cut, filtering out fluff before detailed processing.
So far I have tried using: xml_remove
xmlfile <- paste0(dir,"xmlFiles/",filelist[k])
file<-read_xml(xmlfile)
file<-xml_ns_strip(file)
for(counx in 1:nrow(xpathTable)){
xr <- xml_find_all(file, xpath =paste0('/',toString(xpathTable$xpaths[counx])) )
xml_remove(xr, free = TRUE)
file<-file
}
This works well for removing few nodes but crashes as the numbers go up (>100)
Below show a kind of example of what I want to get too
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="children">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
<ISBN>
<Random>12354</Random>
</ISBN>
</book>
<book category="web">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>
<book category="web">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<ISBN>
<Random>12345</Random>
</ISBN>
<price>39.95</price>
</book>
</bookstore>
Filter by XPaths
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<year>2005</year>
</book>
<book category="children">
<title lang="en">Harry Potter</title>
<year>2005</year>
<ISBN>
<Random>12354</Random>
</ISBN>
</book>
<book category="web">
<title lang="en">XQuery Kick Start</title>
<year>2003</year>
</book>
<book category="web">
<title lang="en">Learning XML</title>
<year>2003</year>
<ISBN>
<Random>12345</Random>
</ISBN>
</book>
</bookstore>
Upvotes: 0
Views: 339
Reputation: 12682
All elements could be looked up in a single XPath 1.0 expression valid for many languages:
/bookstore/book/descendant::*[name()="title" or name()="year" or name()="Random"]
Equivalent/similar expressions:
/bookstore/book/title | /bookstore/book/year | /bookstore/book/ISBN/Random
//book/@category | //book/year | //ISBN/Random
To filter out elements:
//book/*[not(name()="title" or name()="year" or name()="ISBN" or name()="Random")]
For XMLs with namespaces, local-name()
can be used instead of name()
if namespace handling is not used.
For the given example and elements and testing on command line:
echo 'cat /bookstore/book/descendant::*[name()="title" or name()="year" or name()="Random"]' | xmllint --shell test.xml
Result:
/ > cat /bookstore/book/descendant::*[name()="title" or name()="year" or name()="Random"]
-------
<title lang="en">Everyday Italian</title>
-------
<year>2005</year>
-------
<title lang="en">Harry Potter</title>
-------
<year>2005</year>
-------
<Random>12354</Random>
-------
<title lang="en">XQuery Kick Start</title>
-------
<year>2003</year>
-------
<title lang="en">Learning XML</title>
-------
<year>2003</year>
-------
<Random>12345</Random>
/ >
For the mentioned R crash, worth looking here.
Upvotes: 0
Reputation: 18950
Looks like an XQuery job, e.g. you could recreate your document like this
<bookstore>{
for $book in /bookstore/*
return <book category="{$book/@category}">
{$book/title}
{$book/year}
{$book/ISBN}
</book>
}</bookstore>
Using the book example to get the result below it. You can test this online here using XQuery as an option https://www.videlibri.de/cgi-bin/xidelcgi
There might be ways to run XQuery from R but I would rather do it in a pre-processing step from the command line using a tool like xidel.
Upvotes: 1