Reputation: 33
I am having trouble with unnest_wider (from tidyr). I have this nested XML document that I am trying to convert into a dataframe/tibble. I followed the workflow presented here, which proposes to turn the XML nodeset into R lists.
My XML is OAI/Dublin Core formatted, and I have several elements in it that have the same name ("subject.other", for example). Simplified, my doc.xml
looks like this :
<?xml version="1.0" encoding="utf-8"?>
<ListRecords>
<record>
<header>
<identifier>id_01</identifier>
<datestamp>2024-05</datestamp>
</header>
<metadata>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:publisher>Fake Editions</dc:publisher>
<dc:subject.other>Great subject n°1</dc:subject.other>
<dc:subject.other>Great subject n°2</dc:subject.other>
<dc:subject.other>Great subject n°3</dc:subject.other>
<dc:subject.other>Great subject n°4</dc:subject.other>
<dc:subject.other>Great subject n°5</dc:subject.other>
<dc:subject.other>Great subject n°6</dc:subject.other>
<dc:title>Random title</dc:title>
</oai_dc:dc>
</metadata>
</record>
</ListRecords>
What I tried
The code that I ran is the following :
# doc.xml is turned into a list
doc_list <- xmlconvert::xml_to_list(read_xml("doc.xml"))
# the list becomes a tibble
df <- tibble::enframe(doc_list)
# unnesting the column "value", where we find the listed elements contained in <header> and <metadata> in the XML
final_df <- df %>%
unnest_wider(value, names_repair = "universal")
with doc_list
being :
list(header = list(identifier = "id_01", datestamp = "2024-05"),
metadata = structure(list(publisher = "Fake Editions", subject.other = "Great subject n°1",
subject.other = "Great subject n°2", subject.other = "Great subject n°3",
subject.other = "Great subject n°4", subject.other = "Great subject n°5",
subject.other = "Great subject n°6", title = "Random title"), schemaLocation = "http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd", "`xmlns:dc`" = "http://purl.org/dc/elements/1.1/", "`xmlns:oai_dc`" = "http://www.openarchives.org/OAI/2.0/oai_dc/", "`xmlns:xsi`" = "http://www.w3.org/2001/XMLSchema-instance"))
Expectations...
What I would like my final_df
to look like in the end is something like that :
structure(list(
identifier = "id_01",
publisher = "Fake Editions",
subject.other_1 = "Great subject n°1",
subject.other_2 = "Great subject n°2",
subject.other_3 = "Great subject n°3"),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))
...reality
But what I get is :
structure(list(
identifier = "id_01",
publisher = "Fake Editions",
subject.other_1 = "Great subject n°1",
subject.other_2 = "Great subject n°1",
subject.other_3 = "Great subject n°1"),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))
As you can see, what happens is that the actual values contained in the different "subject.other" elements get erased and replaced by the value contained in the first one ("Great subject n°1"). I tried changing the .names_repair
options, but it didn't change anything.
Would you see any solution to make it work ? I have tried everything to get this XML into a dataframe/tibble, and I am losing hope ! Thank you very much !
(I can provide more code/details, sorry I am not used to asking questions on Stackoverflow)
Edit : with several records :
<?xml version="1.0" encoding="utf-8"?>
<ListRecords>
<record>
<header>
<identifier>id_01</identifier>
<datestamp>2024-05</datestamp>
</header>
<metadata>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:publisher>Fake Editions</dc:publisher>
<dc:subject.other>Great subject n°1</dc:subject.other>
<dc:subject.other>Great subject n°2</dc:subject.other>
<dc:subject.other>Great subject n°3</dc:subject.other>
<dc:subject.other>Great subject n°4</dc:subject.other>
<dc:subject.other>Great subject n°5</dc:subject.other>
<dc:subject.other>Great subject n°6</dc:subject.other>
<dc:title>Random title</dc:title>
</oai_dc:dc>
</metadata>
</record>
<record>
<header>
<identifier>id_02</identifier>
<datestamp>2023-05</datestamp>
</header>
<metadata>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:publisher>Fake Editions</dc:publisher>
<dc:subject.other>Great subject n°7</dc:subject.other>
<dc:subject.other>Great subject n°8</dc:subject.other>
<dc:subject.other>Great subject n°9</dc:subject.other>
<dc:subject.other>Great subject n°10</dc:subject.other>
<dc:subject.other>Great subject n°11</dc:subject.other>
<dc:subject.other>Great subject n°12</dc:subject.other>
<dc:title>Random title</dc:title>
</oai_dc:dc>
</metadata>
</record>
<record>
<header>
<identifier>id_03</identifier>
<datestamp>2020-05</datestamp>
</header>
<metadata>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:publisher>Fake Editions</dc:publisher>
<dc:subject.other>Great subject X</dc:subject.other>
<dc:subject.other>Great subject Y</dc:subject.other>
<dc:subject.other>Great subject Z</dc:subject.other>
<dc:subject.other>Great subject A</dc:subject.other>
<dc:subject.other>Great subject B</dc:subject.other>
<dc:subject.other>Great subject C</dc:subject.other>
<dc:title>Random title</dc:title>
</oai_dc:dc>
</metadata>
</record>
<record>
<header>
<identifier>id_04</identifier>
<datestamp>1990-05</datestamp>
</header>
<metadata>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:publisher>Fake Editions</dc:publisher>
<dc:subject.other>Great subject n°1</dc:subject.other>
<dc:subject.other>Great subject n°2</dc:subject.other>
<dc:subject.other>Great subject n°3</dc:subject.other>
<dc:subject.other>Great subject n°4</dc:subject.other>
<dc:subject.other>Great subject n°5</dc:subject.other>
<dc:subject.other>Great subject n°6</dc:subject.other>
<dc:title>Random title</dc:title>
</oai_dc:dc>
</metadata>
</record>
</ListRecords>
Upvotes: 0
Views: 63
Reputation: 33
I've been trying many things and this seems to work (though I am avoiding the problem) :
df <- doc_list %>%
transpose() %>%
as_tibble(.name_repair = "unique") %>%
unnest_longer(metadata, names_repair = "unique") %>%
unnest_wider(header)
For a reason that I ignore, unnest_longer preserves all the data when unnest_wider erases the following occurences (maybe because the element name does not become a column name ?). Now I will just pivot_wider my tibble... but I am a bit frustrated that I don't understand what's going on inside unnest_wider when you have many named lists with the same name.
Upvotes: 1
Reputation: 2132
See if it works with xml2::as_list
:
Toy data:
doc_list <- read_xml("<?xml version=\"1.0\" encoding=\"utf-8\"?><ListRecords><record><header><identifier>id_01</identifier><datestamp>2024-05</datestamp></header><metadata><oai_dc:dc xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:oai_dc=\"http://www.openarchives.org/OAI/2.0/oai_dc/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd\"><dc:publisher>Fake Editions</dc:publisher><dc:subject.other>Great subject n°1</dc:subject.other><dc:subject.other>Great subject n°2</dc:subject.other><dc:subject.other>Great subject n°3</dc:subject.other><dc:subject.other>Great subject n°4</dc:subject.other><dc:subject.other>Great subject n°5</dc:subject.other><dc:subject.other>Great subject n°6</dc:subject.other><dc:title>Random title</dc:title></oai_dc:dc></metadata></record></ListRecords>")
The code:
doc_df <- doc_list %>%
xml2::as_list() %>%
unlist() %>%
as_tibble_row(.name_repair = "unique")
The output:
> glimpse(doc_df)
Rows: 1
Columns: 10
$ ListRecords.record.header.identifier <chr> "id_01"
$ ListRecords.record.header.datestamp <chr> "2024-05"
$ ListRecords.record.metadata.dc.publisher <chr> "Fake Editions"
$ ListRecords.record.metadata.dc.subject.other...4 <chr> "Great subject n°1"
$ ListRecords.record.metadata.dc.subject.other...5 <chr> "Great subject n°2"
$ ListRecords.record.metadata.dc.subject.other...6 <chr> "Great subject n°3"
$ ListRecords.record.metadata.dc.subject.other...7 <chr> "Great subject n°4"
$ ListRecords.record.metadata.dc.subject.other...8 <chr> "Great subject n°5"
$ ListRecords.record.metadata.dc.subject.other...9 <chr> "Great subject n°6"
$ ListRecords.record.metadata.dc.title <chr> "Random title"
After the comments, here's another approach with xml_find_all
+ purrr::map
, without xml2::as_list
over a bigger toy data (thanks):
Toy data:
# Pkgs
library(tidyverse)
library(xml2)
# Bigger toy data
toy_data <- "<?xml version=\"1.0\" encoding=\"utf-8\"?><ListRecords><record><header><identifier>id_01</identifier><datestamp>2024-05</datestamp></header><metadata><oai_dc:dc xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:oai_dc=\"http://www.openarchives.org/OAI/2.0/oai_dc/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd\"><dc:publisher>Fake Editions</dc:publisher><dc:subject.other>Great subject n°1</dc:subject.other><dc:subject.other>Great subject n°2</dc:subject.other><dc:subject.other>Great subject n°3</dc:subject.other><dc:subject.other>Great subject n°4</dc:subject.other><dc:subject.other>Great subject n°5</dc:subject.other><dc:subject.other>Great subject n°6</dc:subject.other><dc:title>Random title</dc:title></oai_dc:dc></metadata></record><record><header><identifier>id_02</identifier><datestamp>2023-05</datestamp></header><metadata><oai_dc:dc xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:oai_dc=\"http://www.openarchives.org/OAI/2.0/oai_dc/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd\"><dc:publisher>Fake Editions</dc:publisher><dc:subject.other>Great subject n°7</dc:subject.other><dc:subject.other>Great subject n°8</dc:subject.other><dc:subject.other>Great subject n°9</dc:subject.other><dc:subject.other>Great subject n°10</dc:subject.other><dc:subject.other>Great subject n°11</dc:subject.other><dc:subject.other>Great subject n°12</dc:subject.other><dc:title>Random title</dc:title></oai_dc:dc></metadata></record><record><header><identifier>id_03</identifier><datestamp>2020-05</datestamp></header><metadata><oai_dc:dc xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:oai_dc=\"http://www.openarchives.org/OAI/2.0/oai_dc/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd\"><dc:publisher>Fake Editions</dc:publisher><dc:subject.other>Great subject X</dc:subject.other><dc:subject.other>Great subject Y</dc:subject.other><dc:subject.other>Great subject Z</dc:subject.other><dc:subject.other>Great subject A</dc:subject.other><dc:subject.other>Great subject B</dc:subject.other><dc:subject.other>Great subject C</dc:subject.other><dc:title>Random title</dc:title></oai_dc:dc></metadata></record><record><header><identifier>id_04</identifier><datestamp>1990-05</datestamp></header><metadata><oai_dc:dc xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:oai_dc=\"http://www.openarchives.org/OAI/2.0/oai_dc/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd\"><dc:publisher>Fake Editions</dc:publisher><dc:subject.other>Great subject n°1</dc:subject.other><dc:subject.other>Great subject n°2</dc:subject.other><dc:subject.other>Great subject n°3</dc:subject.other><dc:subject.other>Great subject n°4</dc:subject.other><dc:subject.other>Great subject n°5</dc:subject.other><dc:subject.other>Great subject n°6</dc:subject.other><dc:title>Random title</dc:title></oai_dc:dc></metadata></record></ListRecords>"
toy_xml <- read_xml(toy_data)
The code:
# `xml_find_all` replaces `as_list`
# `purrr::map_dfr` building a tibble instead of `as_tibble_`
# You can inspect your xml with `xml_structure`
toy_df <- map_dfr(
xml_find_all(toy_xml, ".//record"),
\(x_record) tibble(
identifier = xml_text(xml_find_first(x_record, ".//identifier")),
datestamp = xml_text(xml_find_first(x_record, ".//datestamp")),
publisher = xml_text(xml_find_first(x_record, ".//dc:publisher")),
subjects = xml_text(xml_find_all( x_record, ".//dc:subject.other")),
title = xml_text(xml_find_first(x_record, ".//dc:title"))))
The output:
> toy_df %>% print(n = nrow(.))
# A tibble: 24 × 5
identifier datestamp publisher subjects title
<chr> <chr> <chr> <chr> <chr>
1 id_01 2024-05 Fake Editions Great subject n°1 Random title
2 id_01 2024-05 Fake Editions Great subject n°2 Random title
3 id_01 2024-05 Fake Editions Great subject n°3 Random title
4 id_01 2024-05 Fake Editions Great subject n°4 Random title
5 id_01 2024-05 Fake Editions Great subject n°5 Random title
6 id_01 2024-05 Fake Editions Great subject n°6 Random title
7 id_02 2023-05 Fake Editions Great subject n°7 Random title
8 id_02 2023-05 Fake Editions Great subject n°8 Random title
9 id_02 2023-05 Fake Editions Great subject n°9 Random title
10 id_02 2023-05 Fake Editions Great subject n°10 Random title
11 id_02 2023-05 Fake Editions Great subject n°11 Random title
12 id_02 2023-05 Fake Editions Great subject n°12 Random title
13 id_03 2020-05 Fake Editions Great subject X Random title
14 id_03 2020-05 Fake Editions Great subject Y Random title
15 id_03 2020-05 Fake Editions Great subject Z Random title
16 id_03 2020-05 Fake Editions Great subject A Random title
17 id_03 2020-05 Fake Editions Great subject B Random title
18 id_03 2020-05 Fake Editions Great subject C Random title
19 id_04 1990-05 Fake Editions Great subject n°1 Random title
20 id_04 1990-05 Fake Editions Great subject n°2 Random title
21 id_04 1990-05 Fake Editions Great subject n°3 Random title
22 id_04 1990-05 Fake Editions Great subject n°4 Random title
23 id_04 1990-05 Fake Editions Great subject n°5 Random title
24 id_04 1990-05 Fake Editions Great subject n°6 Random title
Hope we get it right now!
Created on 2024-05-07 with reprex v2.1.0
Upvotes: 0