Extracting nodes by name

Question

I am trying to parse an XML file with xml2. But I cannot for the life of me figure out how to do it by specifying the name.

This works:

library(xml2)
library(dplyr)

xml <- read_xml(file)

-->

> xml
{xml_document}

[1] 
  
    15181
    30052 ...
[3] 73363063
[4] b8f69d6276d9c4929e74416bc9e3446a173d1894

And I can extract by position both step-by-step and with xpath:

xml_child(xml, 1) %>% xml_child(7) %>% xml_attr("startTimeStamp")

_

xml_child(xml, "/*[1]/*[7]") %>% xml_attr("startTimeStamp")

However my attemps to select by name are failing.

> xml_child(xml, "indexedmzML")
{xml_missing}

> xml_child(xml, "mzML")
{xml_missing}

and

> xml_child(xml, "/indexedmzML")
{xml_missing}

> xml_child(xml, "/mzML")
{xml_missing}

and

> xml_child(xml, "/mzML/run")
{xml_missing}

Can somehow point me to the solution that is somehow escaping me?

EDIT:

OK here is a data example. With that data what I want is

xml_child(xml, 1) %>% xml_child(2) %>% xml_attr("startTimeStamp")

But selected by name.

fmic_ · Accepted Answer

If you want to extract all the startTimeStamp values from your XML file, you can do:

xml %>% xml_find_all("//@startTimeStamp") %>% xml_text()

EDIT:

If you want to select it by name, then you need to worry about namespaces.

Indeed,

xml %>% xml_child("mzML")

will return

{xml_missing}

you first need to check the namespaces associated with your XML file:

xml_ns(xml)
# d1   <-> http://psi.hupo.org/ms/mzml
# d2   <-> http://psi.hupo.org/ms/mzml
# xsi  <-> http://www.w3.org/2001/XMLSchema-instance
# xsi1 <-> http://www.w3.org/2001/XMLSchema-instance

so you'd need to use:

xml %>% xml_child("d1:mzML")

For the full path to the attribute you're interested in:

xml %>% xml_child("d1:mzML") %>% xml_child("d1:run") %>% xml_attr("startTimeStamp")

The documentation of the xml_ns() gives a little more information, and encourages you to rename your namespaces with more informative names.

Extracting nodes by name

Answers (1)

Related Questions