GCGM
GCGM

Reputation: 1073

Retrieve value of XML tag in R

I am working with the following XML code and my intention is to extract the value of the <file name> tag. So far, I have managed to isolate the <file> tag but not sure how to continue.

<?xml version="1.0" encoding="UTF-8"?>
<metalink xmlns="urn:ietf:params:xml:ns:metalink">
  <file name="S2B_MSIL2A_20190812T105629_N0213_R094_T30TVK_20190812T135547.zip">
    <hash type="MD5">C6D1AE69805DBB5C26F4467D5FD2EAF3</hash>
    <size>1224364821</size>
    <url>https://scihub.copernicus.eu/dhus/odata/v1/Products('ccc04758-2f4e-4130-9f1a-dce5f21a2ef0')/$value</url>
  </file>
  <file name="S2B_MSIL2A_20190812T105629_N0213_R094_T30TWL_20190812T135547.zip">
    <hash type="MD5">E5AD6A272E95ECBFEE8FCDB5A5214343</hash>
    <size>1209895563</size>
    <url>https://scihub.copernicus.eu/dhus/odata/v1/Products('534d7e19-03fa-4aee-93cf-dca980fa2e5a')/$value</url>
  </file>
</metalink>

I am using the following R code:

require(XML)

file<-xmlParse('my/xml/file') 
test<-xmlRoot(file)
test2<-xmlElementsByTagName(test, "file", recursive = TRUE)
test2

Where the output is as follows:

> test2
$file
<file name="S2B_MSIL2A_20190812T105629_N0213_R094_T30TVK_20190812T135547.zip">
  <hash type="MD5">C6D1AE69805DBB5C26F4467D5FD2EAF3</hash>
  <size>1224364821</size>
  <url>https://scihub.copernicus.eu/dhus/odata/v1/Products('ccc04758-2f4e-4130-9f1a-dce5f21a2ef0')/$value</url>
</file> 

$file
<file name="S2B_MSIL2A_20190812T105629_N0213_R094_T30TWL_20190812T135547.zip">
  <hash type="MD5">E5AD6A272E95ECBFEE8FCDB5A5214343</hash>
  <size>1209895563</size>
  <url>https://scihub.copernicus.eu/dhus/odata/v1/Products('534d7e19-03fa-4aee-93cf-dca980fa2e5a')/$value</url>
</file>

The objective would be to store the S2B_MSIL2A........zip values in a variable or R list. I have tried to use file name in the xmlElementsByTagName function but it outputs a empty list.

Upvotes: 1

Views: 1696

Answers (3)

Dave2e
Dave2e

Reputation: 24079

Here using the xml2 package (I find it easier to use than XML). This is just a matter of searching for the "file" nodes and retrieving the text associated with the 'name' attributes.

This file does contain a namespace which does slightly complicates the matter. There are 2 options, add the namespace into the xpath, or strip the namespace out and use a simpler xpath phrase.

library(dplyr)
library(xml2)
#read the file
page<-read_xml("test.xml")

#FYI
print(xml_ns(page))

First Option:

#find the "file" node within the namespace
option1<-page %>% xml_find_all("//d1:file", ns=xml_ns(page)) %>%xml_attr(attr="name")

Second Option:

#Strip out the name space then find the node with a simiplier xpath
#If it is a long complex file, `xml_ns_strip` could take awhile
page %>% xml_ns_strip()
option2<-page %>% xml_find_all("//file") %>%xml_attr(attr="name")

Upvotes: 1

Parfait
Parfait

Reputation: 107587

Because the file node is encapsulated in a default namespace, any reference to local name, file, returns nothing. To resolve, you need to map any defined prefix such as doc and use it as a colon separated prefix when referencing node: doc:file.

Consider the XML apply methods that can handle such namespace mapping:

ns <- c(doc="urn:ietf:params:xml:ns:metalink")
xpathApply(file, path='//doc:file', namespaces = ns, xmlAttrs)           # RETURNS A LIST
# [[1]]
#                                                               name 
# "S2B_MSIL2A_20190812T105629_N0213_R094_T30TVK_20190812T135547.zip" 
# 
# [[2]]
#                                                               name 
# "S2B_MSIL2A_20190812T105629_N0213_R094_T30TWL_20190812T135547.zip

xpathSApply(file, path='//doc:file', namespaces = ns, xmlAttrs)         # RETURNS A VECTOR
#                                                               name 
# "S2B_MSIL2A_20190812T105629_N0213_R094_T30TVK_20190812T135547.zip" 
#                                                               name 
# "S2B_MSIL2A_20190812T105629_N0213_R094_T30TWL_20190812T135547.zip"

Alternatively, consider the non-documented method xmlAttrsToDataFrame available as a triple-colon method to build a data frame:

ns <- c(doc="urn:ietf:params:xml:ns:metalink")
file_df <- XML:::xmlAttrsToDataFrame(getNodeSet(file, path='//doc:file', namespaces = ns))

file_df
#                                                               name
# 1 S2B_MSIL2A_20190812T105629_N0213_R094_T30TVK_20190812T135547.zip
# 2 S2B_MSIL2A_20190812T105629_N0213_R094_T30TWL_20190812T135547.zip

Upvotes: 1

maydin
maydin

Reputation: 3755

By using xml2,

library(xml2)
library(magrittr)
my_data <- read_xml( "./desktop/data.xml" )
output <- xml_children(my_data) %>% xml_attr("name")

gives,

> output
[1] "S2B_MSIL2A_20190812T105629_N0213_R094_T30TVK_20190812T135547.zip"
[2] "S2B_MSIL2A_20190812T105629_N0213_R094_T30TWL_20190812T135547.zip"

Upvotes: 1

Related Questions