Reputation: 33

Unnesting identically named lists from tibble : data gets erased

I am having trouble with unnest_wider (from tidyr). I have this nested XML document that I am trying to convert into a dataframe/tibble. I followed the workflow presented here, which proposes to turn the XML nodeset into R lists.

My XML is OAI/Dublin Core formatted, and I have several elements in it that have the same name ("subject.other", for example). Simplified, my doc.xml looks like this :

<?xml version="1.0" encoding="utf-8"?>
<ListRecords>
  <record>
    <header>
      <identifier>id_01</identifier>
      <datestamp>2024-05</datestamp>
    </header>
    <metadata>
      <oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
        <dc:publisher>Fake Editions</dc:publisher>
        <dc:subject.other>Great subject n°1</dc:subject.other>
        <dc:subject.other>Great subject n°2</dc:subject.other>
        <dc:subject.other>Great subject n°3</dc:subject.other>
        <dc:subject.other>Great subject n°4</dc:subject.other>
        <dc:subject.other>Great subject n°5</dc:subject.other>
        <dc:subject.other>Great subject n°6</dc:subject.other>
        <dc:title>Random title</dc:title>
      </oai_dc:dc>
    </metadata>
  </record>
</ListRecords>

What I tried

The code that I ran is the following :

# doc.xml is turned into a list
doc_list <- xmlconvert::xml_to_list(read_xml("doc.xml")) 

# the list becomes a tibble
df <- tibble::enframe(doc_list) 

# unnesting the column "value", where we find the listed elements contained in <header> and <metadata> in the XML
final_df <- df %>%
  unnest_wider(value, names_repair = "universal")

with doc_list being :

list(header = list(identifier = "id_01", datestamp = "2024-05"), 
    metadata = structure(list(publisher = "Fake Editions", subject.other = "Great subject n°1", 
        subject.other = "Great subject n°2", subject.other = "Great subject n°3", 
        subject.other = "Great subject n°4", subject.other = "Great subject n°5", 
        subject.other = "Great subject n°6", title = "Random title"), schemaLocation = "http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd", "`xmlns:dc`" = "http://purl.org/dc/elements/1.1/", "`xmlns:oai_dc`" = "http://www.openarchives.org/OAI/2.0/oai_dc/", "`xmlns:xsi`" = "http://www.w3.org/2001/XMLSchema-instance"))

Expectations...

What I would like my final_df to look like in the end is something like that :

structure(list(
  identifier = "id_01",
  publisher = "Fake Editions", 
  subject.other_1 = "Great subject n°1",
  subject.other_2 = "Great subject n°2", 
  subject.other_3 = "Great subject n°3"), 
  class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))

...reality

But what I get is :

structure(list(
  identifier = "id_01",
  publisher = "Fake Editions", 
  subject.other_1 = "Great subject n°1",
  subject.other_2 = "Great subject n°1", 
  subject.other_3 = "Great subject n°1"), 
  class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))

As you can see, what happens is that the actual values contained in the different "subject.other" elements get erased and replaced by the value contained in the first one ("Great subject n°1"). I tried changing the .names_repair options, but it didn't change anything.

Would you see any solution to make it work ? I have tried everything to get this XML into a dataframe/tibble, and I am losing hope ! Thank you very much !

(I can provide more code/details, sorry I am not used to asking questions on Stackoverflow)

Edit : with several records :

<?xml version="1.0" encoding="utf-8"?>
<ListRecords>
  <record>
    <header>
      <identifier>id_01</identifier>
      <datestamp>2024-05</datestamp>
    </header>
    <metadata>
      <oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
        <dc:publisher>Fake Editions</dc:publisher>
        <dc:subject.other>Great subject n°1</dc:subject.other>
        <dc:subject.other>Great subject n°2</dc:subject.other>
        <dc:subject.other>Great subject n°3</dc:subject.other>
        <dc:subject.other>Great subject n°4</dc:subject.other>
        <dc:subject.other>Great subject n°5</dc:subject.other>
        <dc:subject.other>Great subject n°6</dc:subject.other>
        <dc:title>Random title</dc:title>
      </oai_dc:dc>
    </metadata>
  </record>
  <record>
    <header>
      <identifier>id_02</identifier>
      <datestamp>2023-05</datestamp>
    </header>
    <metadata>
      <oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
        <dc:publisher>Fake Editions</dc:publisher>
        <dc:subject.other>Great subject n°7</dc:subject.other>
        <dc:subject.other>Great subject n°8</dc:subject.other>
        <dc:subject.other>Great subject n°9</dc:subject.other>
        <dc:subject.other>Great subject n°10</dc:subject.other>
        <dc:subject.other>Great subject n°11</dc:subject.other>
        <dc:subject.other>Great subject n°12</dc:subject.other>
        <dc:title>Random title</dc:title>
      </oai_dc:dc>
    </metadata>
  </record>
  <record>
    <header>
      <identifier>id_03</identifier>
      <datestamp>2020-05</datestamp>
    </header>
    <metadata>
      <oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
        <dc:publisher>Fake Editions</dc:publisher>
        <dc:subject.other>Great subject X</dc:subject.other>
        <dc:subject.other>Great subject Y</dc:subject.other>
        <dc:subject.other>Great subject Z</dc:subject.other>
        <dc:subject.other>Great subject A</dc:subject.other>
        <dc:subject.other>Great subject B</dc:subject.other>
        <dc:subject.other>Great subject C</dc:subject.other>
        <dc:title>Random title</dc:title>
      </oai_dc:dc>
    </metadata>
  </record>
  <record>
    <header>
      <identifier>id_04</identifier>
      <datestamp>1990-05</datestamp>
    </header>
    <metadata>
      <oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
        <dc:publisher>Fake Editions</dc:publisher>
        <dc:subject.other>Great subject n°1</dc:subject.other>
        <dc:subject.other>Great subject n°2</dc:subject.other>
        <dc:subject.other>Great subject n°3</dc:subject.other>
        <dc:subject.other>Great subject n°4</dc:subject.other>
        <dc:subject.other>Great subject n°5</dc:subject.other>
        <dc:subject.other>Great subject n°6</dc:subject.other>
        <dc:title>Random title</dc:title>
      </oai_dc:dc>
    </metadata>
  </record>
</ListRecords>

Upvotes: 0

Answers (2)

PierreGeorge

Reputation: 33

I've been trying many things and this seems to work (though I am avoiding the problem) :

df <- doc_list %>% 
  transpose() %>%
  as_tibble(.name_repair = "unique") %>%
  unnest_longer(metadata, names_repair = "unique") %>%
  unnest_wider(header)

For a reason that I ignore, unnest_longer preserves all the data when unnest_wider erases the following occurences (maybe because the element name does not become a column name ?). Now I will just pivot_wider my tibble... but I am a bit frustrated that I don't understand what's going on inside unnest_wider when you have many named lists with the same name.

Upvotes: 1

Adriano Mello

Reputation: 2132

See if it works with xml2::as_list:

Toy data:

doc_list <- read_xml("<?xml version=\"1.0\" encoding=\"utf-8\"?><ListRecords><record><header><identifier>id_01</identifier><datestamp>2024-05</datestamp></header><metadata><oai_dc:dc xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:oai_dc=\"http://www.openarchives.org/OAI/2.0/oai_dc/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd\"><dc:publisher>Fake Editions</dc:publisher><dc:subject.other>Great subject n°1</dc:subject.other><dc:subject.other>Great subject n°2</dc:subject.other><dc:subject.other>Great subject n°3</dc:subject.other><dc:subject.other>Great subject n°4</dc:subject.other><dc:subject.other>Great subject n°5</dc:subject.other><dc:subject.other>Great subject n°6</dc:subject.other><dc:title>Random title</dc:title></oai_dc:dc></metadata></record></ListRecords>")

The code:

doc_df <- doc_list %>% 
  xml2::as_list() %>% 
  unlist() %>% 
  as_tibble_row(.name_repair = "unique")

The output:

> glimpse(doc_df)
Rows: 1
Columns: 10
$ ListRecords.record.header.identifier             <chr> "id_01"
$ ListRecords.record.header.datestamp              <chr> "2024-05"
$ ListRecords.record.metadata.dc.publisher         <chr> "Fake Editions"
$ ListRecords.record.metadata.dc.subject.other...4 <chr> "Great subject n°1"
$ ListRecords.record.metadata.dc.subject.other...5 <chr> "Great subject n°2"
$ ListRecords.record.metadata.dc.subject.other...6 <chr> "Great subject n°3"
$ ListRecords.record.metadata.dc.subject.other...7 <chr> "Great subject n°4"
$ ListRecords.record.metadata.dc.subject.other...8 <chr> "Great subject n°5"
$ ListRecords.record.metadata.dc.subject.other...9 <chr> "Great subject n°6"
$ ListRecords.record.metadata.dc.title             <chr> "Random title"

Uptade

After the comments, here's another approach with xml_find_all + purrr::map, without xml2::as_list over a bigger toy data (thanks):