Reputation: 1023
I am trying to transform data from XML to tabular form. I am struggling with elements with subchildren. Here's an example:
library(xml2)
library(data.table)
doc =
"<doc>
<rec>
<name> John </name>
<address>
<street> 2nd Av </street>
<number> 1036 </number>
</address>
<hobbies>
<hobby> tennis </hobby>
<hobby> gardening </hobby>
</hobbies>
</rec>
<rec>
<name> Mary </name>
<address>
<street>55th St</street>
<number> 132 </number>
</address>
<hobbies>
<hobby> running </hobby>
</hobbies>
</rec>
</doc>
"
# read in
pg <- read_xml(doc)
# make a list of records
recs = xml_find_all(pg, "//rec", xml_ns(pg))
# function to loop over list
extractRecord = function(x) {
txt = xml_text(xml_children(x))
name = xml_name(xml_children(x))
names(txt) = name
dt = setDT(as.list(txt))[]
return(dt)
}
# loop over list of records
lst = lapply(recs, extractRecord)
# bind elements do a data table
dt = rbindlist(lst, use.names = T, fill = T); dt
> name address hobbies
> 1: John 2nd Av 1036 tennis gardening
> 2: Mary 55th St 132 running
This works like a charm, except that I would like to have:
In the end, I'd have something like
I would aslo like to stick to the xml2
package if possible (because I have many, large-ish files and the XML
package has known memory issues that become a problem in my case).
Upvotes: 2
Views: 208
Reputation: 107687
Consider XSLT special-purpose, declarative language (same type as SQL) designed to transform XML files such as flattening your original input. In R, XSLT can be run with the sister package to xml2
: xslt
. And since it is an industry language, it can be run with other general-purpose languages (i.e., Java, Python), CLIs (Bash, PowerShell), or executables (Saxon, Xalan) which R can call at command line using system()
.
library(xslt)
xsl <- '<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="text()">
<xsl:value-of select="normalize-space()"/>
</xsl:template>
<xsl:template match="/*/*">
<xsl:copy>
<xsl:copy-of select="*[not(*)]"/>
<xsl:apply-templates select="*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="/*/*/*">
<xsl:apply-templates select="*"/>
</xsl:template>
</xsl:stylesheet>'
See: Online demo
Similar process as before but with transformation step to create new_pg:
# read in
pg <- read_xml(doc)
style <- read_xml(xsl, package = "xslt")
# transform original
new_pg <- xml_xslt(pg, style)
# make a list of records
recs <- xml_find_all(new_pg, "//rec")
# function to loop over list
extractRecord <- function(x) {
txt <- setNames(xml_text(xml_children(x)),
xml_name(xml_children(x))
)
dt <- setDT(as.list(txt))[]
return(dt)
}
# loop over list of records
lst <- lapply(recs, extractRecord)
# bind elements do a data table
dt <- rbindlist(lst, use.names = TRUE, fill = TRUE)
dt
# name street number hobby hobby
# 1: John 2nd Av 1036 tennis gardening
# 2: Mary 55th St 132 running <NA>
To avoid repeating columns (i.e., hobby) add this template at the end of XSLT (before closing </xsl:stylesheet>
) where you can pipe delimit any other columns you know in advance will have repeating columns:
<!-- PIPE DELIMIT ANY REPEAT NAMED COLS IN TEMPLATE MATCH-->
<xsl:template match="hobby|anothernode|othernode|stillothernode">
<xsl:variable name="num" select="concat(name(), count(preceding-sibling::*)+1)"/>
<xsl:element name="{$num}">
<xsl:value-of select="normalize-space()"/>
</xsl:element>
</xsl:template>
Upvotes: 1
Reputation: 20473
So if you're open to using the tidyverse
here's an approach. First change your function to extract all the data for an individual record:
library(tidyverse)
get_elements <- function(rec) {
name = xml_find_all(rec, "name") %>% xml_text
hobbies = xml_find_all(rec, "hobbies")
hobby_list = hobbies %>% xml_find_all("hobby") %>% xml_text
address = xml_find_all(rec, "address")
street = address %>% xml_find_all("street") %>% xml_text
street_num = address %>% xml_find_all("number") %>% xml_text
df = tibble(
name = str_squish(name),
street = str_squish(street),
street_num = str_squish(street_num),
hobbies = str_squish(hobby_list)
)
return(df)
}
So now, for any given record (e.g. recs[1]
, recs[2]
) we return a table:
get_elements(recs[1])
#> # A tibble: 2 x 4
#> name street street_num hobbies
#> <chr> <chr> <chr> <chr>
#> 1 John 2nd Av 1036 tennis
#> 2 John 2nd Av 1036 gardening
get_elements(recs[2])
#> # A tibble: 1 x 4
#> name street street_num hobbies
#> <chr> <chr> <chr> <chr>
#> 1 Mary 55th St 132 running
Then combined those tables using your favorite method:
res_df <-
bind_rows(
get_elements(recs[1]),
get_elements(recs[2])
)
# More tidyverse/purrr-like:
res_df <-
recs %>%
map_df(get_elements)
res_df
#> # A tibble: 3 x 4
#> name street street_num hobbies
#> <chr> <chr> <chr> <chr>
#> 1 John 2nd Av 1036 tennis
#> 2 John 2nd Av 1036 gardening
#> 3 Mary 55th St 132 running
Finally, do some data wrangling to pivot the data into the desired final format:
res_df %>%
group_by(name) %>%
mutate(
hobby_idx = paste0("hobby", row_number())
) %>%
pivot_wider(
names_from = hobby_idx,
values_from = hobbies
)
#> # A tibble: 2 x 5
#> # Groups: name [2]
#> name street street_num hobby1 hobby2
#> <chr> <chr> <chr> <chr> <chr>
#> 1 John 2nd Av 1036 tennis gardening
#> 2 Mary 55th St 132 running <NA>
Upvotes: 0