PierreGeorge
PierreGeorge

Reputation: 33

Unnesting identically named lists from tibble : data gets erased

I am having trouble with unnest_wider (from tidyr). I have this nested XML document that I am trying to convert into a dataframe/tibble. I followed the workflow presented here, which proposes to turn the XML nodeset into R lists.

My XML is OAI/Dublin Core formatted, and I have several elements in it that have the same name ("subject.other", for example). Simplified, my doc.xml looks like this :

<?xml version="1.0" encoding="utf-8"?>
<ListRecords>
  <record>
    <header>
      <identifier>id_01</identifier>
      <datestamp>2024-05</datestamp>
    </header>
    <metadata>
      <oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
        <dc:publisher>Fake Editions</dc:publisher>
        <dc:subject.other>Great subject n°1</dc:subject.other>
        <dc:subject.other>Great subject n°2</dc:subject.other>
        <dc:subject.other>Great subject n°3</dc:subject.other>
        <dc:subject.other>Great subject n°4</dc:subject.other>
        <dc:subject.other>Great subject n°5</dc:subject.other>
        <dc:subject.other>Great subject n°6</dc:subject.other>
        <dc:title>Random title</dc:title>
      </oai_dc:dc>
    </metadata>
  </record>
</ListRecords>

What I tried

The code that I ran is the following :

# doc.xml is turned into a list
doc_list <- xmlconvert::xml_to_list(read_xml("doc.xml")) 

# the list becomes a tibble
df <- tibble::enframe(doc_list) 

# unnesting the column "value", where we find the listed elements contained in <header> and <metadata> in the XML
final_df <- df %>%
  unnest_wider(value, names_repair = "universal") 

with doc_list being :

list(header = list(identifier = "id_01", datestamp = "2024-05"), 
    metadata = structure(list(publisher = "Fake Editions", subject.other = "Great subject n°1", 
        subject.other = "Great subject n°2", subject.other = "Great subject n°3", 
        subject.other = "Great subject n°4", subject.other = "Great subject n°5", 
        subject.other = "Great subject n°6", title = "Random title"), schemaLocation = "http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd", "`xmlns:dc`" = "http://purl.org/dc/elements/1.1/", "`xmlns:oai_dc`" = "http://www.openarchives.org/OAI/2.0/oai_dc/", "`xmlns:xsi`" = "http://www.w3.org/2001/XMLSchema-instance"))

Expectations...

What I would like my final_df to look like in the end is something like that :

structure(list(
  identifier = "id_01",
  publisher = "Fake Editions", 
  subject.other_1 = "Great subject n°1",
  subject.other_2 = "Great subject n°2", 
  subject.other_3 = "Great subject n°3"), 
  class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))

...reality

But what I get is :

structure(list(
  identifier = "id_01",
  publisher = "Fake Editions", 
  subject.other_1 = "Great subject n°1",
  subject.other_2 = "Great subject n°1", 
  subject.other_3 = "Great subject n°1"), 
  class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))

As you can see, what happens is that the actual values contained in the different "subject.other" elements get erased and replaced by the value contained in the first one ("Great subject n°1"). I tried changing the .names_repair options, but it didn't change anything.

Would you see any solution to make it work ? I have tried everything to get this XML into a dataframe/tibble, and I am losing hope ! Thank you very much !

(I can provide more code/details, sorry I am not used to asking questions on Stackoverflow)

Edit : with several records :

<?xml version="1.0" encoding="utf-8"?>
<ListRecords>
  <record>
    <header>
      <identifier>id_01</identifier>
      <datestamp>2024-05</datestamp>
    </header>
    <metadata>
      <oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
        <dc:publisher>Fake Editions</dc:publisher>
        <dc:subject.other>Great subject n°1</dc:subject.other>
        <dc:subject.other>Great subject n°2</dc:subject.other>
        <dc:subject.other>Great subject n°3</dc:subject.other>
        <dc:subject.other>Great subject n°4</dc:subject.other>
        <dc:subject.other>Great subject n°5</dc:subject.other>
        <dc:subject.other>Great subject n°6</dc:subject.other>
        <dc:title>Random title</dc:title>
      </oai_dc:dc>
    </metadata>
  </record>
  <record>
    <header>
      <identifier>id_02</identifier>
      <datestamp>2023-05</datestamp>
    </header>
    <metadata>
      <oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
        <dc:publisher>Fake Editions</dc:publisher>
        <dc:subject.other>Great subject n°7</dc:subject.other>
        <dc:subject.other>Great subject n°8</dc:subject.other>
        <dc:subject.other>Great subject n°9</dc:subject.other>
        <dc:subject.other>Great subject n°10</dc:subject.other>
        <dc:subject.other>Great subject n°11</dc:subject.other>
        <dc:subject.other>Great subject n°12</dc:subject.other>
        <dc:title>Random title</dc:title>
      </oai_dc:dc>
    </metadata>
  </record>
  <record>
    <header>
      <identifier>id_03</identifier>
      <datestamp>2020-05</datestamp>
    </header>
    <metadata>
      <oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
        <dc:publisher>Fake Editions</dc:publisher>
        <dc:subject.other>Great subject X</dc:subject.other>
        <dc:subject.other>Great subject Y</dc:subject.other>
        <dc:subject.other>Great subject Z</dc:subject.other>
        <dc:subject.other>Great subject A</dc:subject.other>
        <dc:subject.other>Great subject B</dc:subject.other>
        <dc:subject.other>Great subject C</dc:subject.other>
        <dc:title>Random title</dc:title>
      </oai_dc:dc>
    </metadata>
  </record>
  <record>
    <header>
      <identifier>id_04</identifier>
      <datestamp>1990-05</datestamp>
    </header>
    <metadata>
      <oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
        <dc:publisher>Fake Editions</dc:publisher>
        <dc:subject.other>Great subject n°1</dc:subject.other>
        <dc:subject.other>Great subject n°2</dc:subject.other>
        <dc:subject.other>Great subject n°3</dc:subject.other>
        <dc:subject.other>Great subject n°4</dc:subject.other>
        <dc:subject.other>Great subject n°5</dc:subject.other>
        <dc:subject.other>Great subject n°6</dc:subject.other>
        <dc:title>Random title</dc:title>
      </oai_dc:dc>
    </metadata>
  </record>
</ListRecords>

Upvotes: 0

Views: 63

Answers (2)

PierreGeorge
PierreGeorge

Reputation: 33

I've been trying many things and this seems to work (though I am avoiding the problem) :

df <- doc_list %>% 
  transpose() %>%
  as_tibble(.name_repair = "unique") %>%
  unnest_longer(metadata, names_repair = "unique") %>%
  unnest_wider(header)

For a reason that I ignore, unnest_longer preserves all the data when unnest_wider erases the following occurences (maybe because the element name does not become a column name ?). Now I will just pivot_wider my tibble... but I am a bit frustrated that I don't understand what's going on inside unnest_wider when you have many named lists with the same name.

Upvotes: 1

Adriano Mello
Adriano Mello

Reputation: 2132

See if it works with xml2::as_list:

Toy data:

doc_list <- read_xml("<?xml version=\"1.0\" encoding=\"utf-8\"?><ListRecords><record><header><identifier>id_01</identifier><datestamp>2024-05</datestamp></header><metadata><oai_dc:dc xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:oai_dc=\"http://www.openarchives.org/OAI/2.0/oai_dc/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd\"><dc:publisher>Fake Editions</dc:publisher><dc:subject.other>Great subject n°1</dc:subject.other><dc:subject.other>Great subject n°2</dc:subject.other><dc:subject.other>Great subject n°3</dc:subject.other><dc:subject.other>Great subject n°4</dc:subject.other><dc:subject.other>Great subject n°5</dc:subject.other><dc:subject.other>Great subject n°6</dc:subject.other><dc:title>Random title</dc:title></oai_dc:dc></metadata></record></ListRecords>")

The code:

doc_df <- doc_list %>% 
  xml2::as_list() %>% 
  unlist() %>% 
  as_tibble_row(.name_repair = "unique")

The output:

> glimpse(doc_df)
Rows: 1
Columns: 10
$ ListRecords.record.header.identifier             <chr> "id_01"
$ ListRecords.record.header.datestamp              <chr> "2024-05"
$ ListRecords.record.metadata.dc.publisher         <chr> "Fake Editions"
$ ListRecords.record.metadata.dc.subject.other...4 <chr> "Great subject n°1"
$ ListRecords.record.metadata.dc.subject.other...5 <chr> "Great subject n°2"
$ ListRecords.record.metadata.dc.subject.other...6 <chr> "Great subject n°3"
$ ListRecords.record.metadata.dc.subject.other...7 <chr> "Great subject n°4"
$ ListRecords.record.metadata.dc.subject.other...8 <chr> "Great subject n°5"
$ ListRecords.record.metadata.dc.subject.other...9 <chr> "Great subject n°6"
$ ListRecords.record.metadata.dc.title             <chr> "Random title"

Uptade

After the comments, here's another approach with xml_find_all + purrr::map, without xml2::as_list over a bigger toy data (thanks):

Toy data:

# Pkgs
library(tidyverse)
library(xml2)

# Bigger toy data
toy_data <- "<?xml version=\"1.0\" encoding=\"utf-8\"?><ListRecords><record><header><identifier>id_01</identifier><datestamp>2024-05</datestamp></header><metadata><oai_dc:dc xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:oai_dc=\"http://www.openarchives.org/OAI/2.0/oai_dc/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd\"><dc:publisher>Fake Editions</dc:publisher><dc:subject.other>Great subject n°1</dc:subject.other><dc:subject.other>Great subject n°2</dc:subject.other><dc:subject.other>Great subject n°3</dc:subject.other><dc:subject.other>Great subject n°4</dc:subject.other><dc:subject.other>Great subject n°5</dc:subject.other><dc:subject.other>Great subject n°6</dc:subject.other><dc:title>Random title</dc:title></oai_dc:dc></metadata></record><record><header><identifier>id_02</identifier><datestamp>2023-05</datestamp></header><metadata><oai_dc:dc xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:oai_dc=\"http://www.openarchives.org/OAI/2.0/oai_dc/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd\"><dc:publisher>Fake Editions</dc:publisher><dc:subject.other>Great subject n°7</dc:subject.other><dc:subject.other>Great subject n°8</dc:subject.other><dc:subject.other>Great subject n°9</dc:subject.other><dc:subject.other>Great subject n°10</dc:subject.other><dc:subject.other>Great subject n°11</dc:subject.other><dc:subject.other>Great subject n°12</dc:subject.other><dc:title>Random title</dc:title></oai_dc:dc></metadata></record><record><header><identifier>id_03</identifier><datestamp>2020-05</datestamp></header><metadata><oai_dc:dc xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:oai_dc=\"http://www.openarchives.org/OAI/2.0/oai_dc/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd\"><dc:publisher>Fake Editions</dc:publisher><dc:subject.other>Great subject X</dc:subject.other><dc:subject.other>Great subject Y</dc:subject.other><dc:subject.other>Great subject Z</dc:subject.other><dc:subject.other>Great subject A</dc:subject.other><dc:subject.other>Great subject B</dc:subject.other><dc:subject.other>Great subject C</dc:subject.other><dc:title>Random title</dc:title></oai_dc:dc></metadata></record><record><header><identifier>id_04</identifier><datestamp>1990-05</datestamp></header><metadata><oai_dc:dc xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:oai_dc=\"http://www.openarchives.org/OAI/2.0/oai_dc/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd\"><dc:publisher>Fake Editions</dc:publisher><dc:subject.other>Great subject n°1</dc:subject.other><dc:subject.other>Great subject n°2</dc:subject.other><dc:subject.other>Great subject n°3</dc:subject.other><dc:subject.other>Great subject n°4</dc:subject.other><dc:subject.other>Great subject n°5</dc:subject.other><dc:subject.other>Great subject n°6</dc:subject.other><dc:title>Random title</dc:title></oai_dc:dc></metadata></record></ListRecords>"
toy_xml  <- read_xml(toy_data)

The code:

# `xml_find_all` replaces `as_list`
# `purrr::map_dfr` building a tibble instead of `as_tibble_`
# You can inspect your xml with `xml_structure`
toy_df <- map_dfr(
  xml_find_all(toy_xml, ".//record"),
  
  \(x_record) tibble(
  identifier = xml_text(xml_find_first(x_record, ".//identifier")),
  datestamp  = xml_text(xml_find_first(x_record, ".//datestamp")),
  publisher  = xml_text(xml_find_first(x_record, ".//dc:publisher")),
  subjects   = xml_text(xml_find_all(  x_record, ".//dc:subject.other")),
  title      = xml_text(xml_find_first(x_record, ".//dc:title"))))

The output:

> toy_df %>% print(n = nrow(.))
# A tibble: 24 × 5
   identifier datestamp publisher     subjects           title       
   <chr>      <chr>     <chr>         <chr>              <chr>       
 1 id_01      2024-05   Fake Editions Great subject n°1  Random title
 2 id_01      2024-05   Fake Editions Great subject n°2  Random title
 3 id_01      2024-05   Fake Editions Great subject n°3  Random title
 4 id_01      2024-05   Fake Editions Great subject n°4  Random title
 5 id_01      2024-05   Fake Editions Great subject n°5  Random title
 6 id_01      2024-05   Fake Editions Great subject n°6  Random title
 7 id_02      2023-05   Fake Editions Great subject n°7  Random title
 8 id_02      2023-05   Fake Editions Great subject n°8  Random title
 9 id_02      2023-05   Fake Editions Great subject n°9  Random title
10 id_02      2023-05   Fake Editions Great subject n°10 Random title
11 id_02      2023-05   Fake Editions Great subject n°11 Random title
12 id_02      2023-05   Fake Editions Great subject n°12 Random title
13 id_03      2020-05   Fake Editions Great subject X    Random title
14 id_03      2020-05   Fake Editions Great subject Y    Random title
15 id_03      2020-05   Fake Editions Great subject Z    Random title
16 id_03      2020-05   Fake Editions Great subject A    Random title
17 id_03      2020-05   Fake Editions Great subject B    Random title
18 id_03      2020-05   Fake Editions Great subject C    Random title
19 id_04      1990-05   Fake Editions Great subject n°1  Random title
20 id_04      1990-05   Fake Editions Great subject n°2  Random title
21 id_04      1990-05   Fake Editions Great subject n°3  Random title
22 id_04      1990-05   Fake Editions Great subject n°4  Random title
23 id_04      1990-05   Fake Editions Great subject n°5  Random title
24 id_04      1990-05   Fake Editions Great subject n°6  Random title

Hope we get it right now!

Created on 2024-05-07 with reprex v2.1.0

Upvotes: 0

Related Questions