R convert XML (complex structure) to dataframe

Question

I'm trying to convert a XML with a complex structure of nodes into a data frame using R. This is a brief example of the XML file:

When I try to convert this file to a data frame, I use the following code (XML library)

library(XML)
doc=xmlParse("file.xml")
xmldf=xmlToDataFrame(nodes = getNodeSet(doc, "//product"))

And after that, the final result is this data frame that you can see bellow:

  id        data                 attributes
1  1 12345London Name_1Value_1Name_2Value_2
2  2 98765London Name_9Value_9Name_8Value_8

How can I get a different data frame, eliminating the complex structure of the XML file to get a result like this?

  id number   city name.1 value.1 name.2 value.2
1  1  12345 London Name_1 Value_1 Name_2 Vlaue_2
2  2  98765 London Name_9 Value_9 Name_8 Value_8

camille · Accepted Answer

I'm less familiar with the XML package but have used the xml2 package more. It fits in with the tidyverse packages, so it works well with a purrr-based approach I'll use here. For each node, I'm calling a function that extracts all its child id, number, city, name, and value nodes and pulls out their text. I did it by product because I wanted to get a small data frame for each in order to make sure all the IDs stay together with the name & value nodes, allowing those to have different lengths. Finally, map_dfr binds the list of data frames row-wise.

library(tidyr)
library(purrr)
library(xml2)


products <- read_xml("text.xml") %>%
  xml_find_all("//product")

prod_df <- map_dfr(products, function(p_node) {
  list(".//id", ".//number", ".//city", ".//name", ".//value") %>%
    set_names(stringr::str_extract, "\w+") %>%
    map(~xml_find_all(p_node, .)) %>%
    map(xml_text) %>%
    as_tibble()
})

prod_df
#> # A tibble: 4 x 5
#>   id    number city   name   value  
#>            
#> 1 1     12345  London Name_1 Value_1
#> 2 1     12345  London Name_2 Value_2
#> 3 2     98765  London Name_9 Value_9
#> 4 2     98765  London Name_8 Value_8

Personally, I'd recommend working in this format, especially since you might have differing numbers of name-value pairs for different products. But if you really need a wide format, you can mark off an observation number for each product's children, then reshape.

prod_df %>%
  dplyr::group_by(id, number, city) %>%
  dplyr::mutate(obs = dplyr::row_number()) %>%
  pivot_wider(names_from = obs, values_from = c(name, value), names_sep = ".")
#> # A tibble: 2 x 7
#> # Groups:   id, number, city [2]
#>   id    number city   name.1 name.2 value.1 value.2
#>                 
#> 1 1     12345  London Name_1 Name_2 Value_1 Value_2
#> 2 2     98765  London Name_9 Name_8 Value_9 Value_8

R convert XML (complex structure) to dataframe

Answers (2)

Related Questions