Reputation: 1023

XML with varying number of subchildren to data frame/table

I am trying to transform data from XML to tabular form. I am struggling with elements with subchildren. Here's an example:

library(xml2)
library(data.table)

doc =
"<doc>
    <rec>
        <name> John </name>
        <address>
            <street> 2nd Av </street>
            <number> 1036 </number>
        </address>
        <hobbies>
            <hobby> tennis </hobby>
            <hobby> gardening </hobby>    
        </hobbies>
    </rec>
    <rec>
        <name> Mary </name>
        <address>
            <street>55th St</street>
            <number> 132 </number>
        </address>
        <hobbies>
            <hobby> running </hobby>
        </hobbies>
    </rec>
</doc>
"

# read in
pg <- read_xml(doc)

# make a list of records
recs = xml_find_all(pg, "//rec", xml_ns(pg))

# function to loop over list
extractRecord = function(x) {
    
    txt = xml_text(xml_children(x))
    name = xml_name(xml_children(x))
    names(txt) = name
    
    dt = setDT(as.list(txt))[]
    return(dt)
}

# loop over list of records
lst = lapply(recs, extractRecord)

# bind elements do a data table
dt  = rbindlist(lst, use.names = T, fill = T); dt

>      name        address             hobbies
> 1:  John   2nd Av  1036   tennis  gardening 
> 2:  Mary    55th St 132             running

This works like a charm, except that I would like to have:

two columns for the "address", one for each subelement of -- say, address.street and address.number.
two columns for hobbies -- say hobby1 and hobby2. Important to note that the number of subchildren of is allowed to vary.

In the end, I'd have something like

I would aslo like to stick to the xml2 package if possible (because I have many, large-ish files and the XML package has known memory issues that become a problem in my case).

Upvotes: 2

Answers (2)

Parfait

Reputation: 107687

Consider XSLT special-purpose, declarative language (same type as SQL) designed to transform XML files such as flattening your original input. In R, XSLT can be run with the sister package to xml2: xslt. And since it is an industry language, it can be run with other general-purpose languages (i.e., Java, Python), CLIs (Bash, PowerShell), or executables (Saxon, Xalan) which R can call at command line using system().

library(xslt)

xsl <- '<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="text()">
    <xsl:value-of select="normalize-space()"/>
  </xsl:template>

  <xsl:template match="/*/*">
    <xsl:copy>
      <xsl:copy-of select="*[not(*)]"/>
      <xsl:apply-templates select="*"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="/*/*/*">
      <xsl:apply-templates select="*"/>
  </xsl:template>

</xsl:stylesheet>'

See: Online demo

Similar process as before but with transformation step to create new_pg:

# read in
pg <- read_xml(doc)
style <- read_xml(xsl, package = "xslt")

# transform original
new_pg <- xml_xslt(pg, style)

# make a list of records
recs <- xml_find_all(new_pg, "//rec")

# function to loop over list
extractRecord <- function(x) {      
  txt <- setNames(xml_text(xml_children(x)), 
                  xml_name(xml_children(x))
         )

  dt <- setDT(as.list(txt))[]
  return(dt)
}

# loop over list of records
lst <- lapply(recs, extractRecord)

# bind elements do a data table
dt <- rbindlist(lst, use.names = TRUE, fill = TRUE)
dt
#     name   street number     hobby       hobby
# 1:  John   2nd Av   1036    tennis   gardening 
# 2:  Mary  55th St    132   running        <NA>

To avoid repeating columns (i.e., hobby) add this template at the end of XSLT (before closing </xsl:stylesheet>) where you can pipe delimit any other columns you know in advance will have repeating columns:

  <!-- PIPE DELIMIT ANY REPEAT NAMED COLS IN TEMPLATE MATCH-->
  <xsl:template match="hobby|anothernode|othernode|stillothernode">
    <xsl:variable name="num" select="concat(name(), count(preceding-sibling::*)+1)"/>
    <xsl:element name="{$num}">
      <xsl:value-of select="normalize-space()"/>
    </xsl:element>  
  </xsl:template>

Upvotes: 1

JasonAizkalns

Reputation: 20473

So if you're open to using the tidyverse here's an approach. First change your function to extract all the data for an individual record:

library(tidyverse)

get_elements <- function(rec) {

  name = xml_find_all(rec, "name") %>% xml_text

  hobbies = xml_find_all(rec, "hobbies")
  hobby_list = hobbies %>% xml_find_all("hobby") %>% xml_text

  address = xml_find_all(rec, "address")
  street = address %>% xml_find_all("street") %>% xml_text
  street_num = address %>% xml_find_all("number") %>% xml_text

  df = tibble(
    name = str_squish(name),
    street = str_squish(street), 
    street_num = str_squish(street_num),
    hobbies = str_squish(hobby_list)
  )
  return(df)
}

So now, for any given record (e.g. recs[1], recs[2]) we return a table:

get_elements(recs[1])
#> # A tibble: 2 x 4
#>   name  street street_num hobbies  
#>   <chr> <chr>  <chr>      <chr>    
#> 1 John  2nd Av 1036       tennis   
#> 2 John  2nd Av 1036       gardening
get_elements(recs[2])
#> # A tibble: 1 x 4
#>   name  street  street_num hobbies
#>   <chr> <chr>   <chr>      <chr>  
#> 1 Mary  55th St 132        running

Then combined those tables using your favorite method:

res_df <- 
bind_rows(
  get_elements(recs[1]),
  get_elements(recs[2])
)

# More tidyverse/purrr-like:
res_df <- 
  recs %>%
  map_df(get_elements)

res_df
#> # A tibble: 3 x 4
#>   name  street  street_num hobbies  
#>   <chr> <chr>   <chr>      <chr>    
#> 1 John  2nd Av  1036       tennis   
#> 2 John  2nd Av  1036       gardening
#> 3 Mary  55th St 132        running

Finally, do some data wrangling to pivot the data into the desired final format:

res_df %>% 
  group_by(name) %>%
  mutate(
    hobby_idx = paste0("hobby", row_number())
  ) %>%
  pivot_wider(
    names_from = hobby_idx,
    values_from = hobbies
  )
#> # A tibble: 2 x 5
#> # Groups:   name [2]
#>   name  street  street_num hobby1  hobby2   
#>   <chr> <chr>   <chr>      <chr>   <chr>    
#> 1 John  2nd Av  1036       tennis  gardening
#> 2 Mary  55th St 132        running <NA>

Upvotes: 0

XML with varying number of subchildren to data frame/table

Answers (2)

Related Questions