Jan Stanstrup
Jan Stanstrup

Reputation: 1232

Extracting nodes by name

I am trying to parse an XML file with xml2. But I cannot for the life of me figure out how to do it by specifying the name.

This works:

library(xml2)
library(dplyr)

xml <- read_xml(file)

-->

> xml
{xml_document}
<indexedmzML schemaLocation="http://psi.hupo.org/ms/mzml http://psidev.info/files/ms/mzML/xsd/mzML1.1.2_idx.xsd" xmlns="http://psi.hupo.org/ms/mzml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
[1] <mzML xmlns="http://psi.hupo.org/ms/mzml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://psi.hupo.org/ms/mzml  ...
[2] <indexList count="2">\n  <index name="spectrum">\n    <offset idRef="scanId=3027">15181</offset>\n    <offset idRef="scanId=3524">30052</offset> ...
[3] <indexListOffset>73363063</indexListOffset>
[4] <fileChecksum>b8f69d6276d9c4929e74416bc9e3446a173d1894</fileChecksum>

And I can extract by position both step-by-step and with xpath:

xml_child(xml, 1) %>% xml_child(7) %>% xml_attr("startTimeStamp")

_

xml_child(xml, "/*[1]/*[7]") %>% xml_attr("startTimeStamp")



However my attemps to select by name are failing.

> xml_child(xml, "indexedmzML")
{xml_missing}
<NA>
> xml_child(xml, "mzML")
{xml_missing}
<NA>

and

> xml_child(xml, "/indexedmzML")
{xml_missing}
<NA>
> xml_child(xml, "/mzML")
{xml_missing}
<NA>

and

> xml_child(xml, "/mzML/run")
{xml_missing}
<NA>



Can somehow point me to the solution that is somehow escaping me?



EDIT:

OK here is a data example. With that data what I want is

xml_child(xml, 1) %>% xml_child(2) %>% xml_attr("startTimeStamp")

But selected by name.

<?xml version="1.0" encoding="utf-8"?>
<indexedmzML xmlns="http://psi.hupo.org/ms/mzml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://psi.hupo.org/ms/mzml http://psidev.info/files/ms/mzML/xsd/mzML1.1.2_idx.xsd">
  <mzML xmlns="http://psi.hupo.org/ms/mzml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://psi.hupo.org/ms/mzml http://psidev.info/files/ms/mzML/xsd/mzML1.1.0.xsd" id="0001_LIP1p_20150803_008_CHCl3-MeOH_1_1" version="1.1.0">

      <dataProcessing id="pwiz_Reader_Agilent_conversion">
        <processingMethod order="0" softwareRef="pwiz">
          <cvParam cvRef="MS" accession="MS:1000544" name="Conversion to mzML" value=""/>
        </processingMethod>
        <processingMethod order="1" softwareRef="pwiz">
          <cvParam cvRef="MS" accession="MS:1000035" name="peak picking" value=""/>
          <userParam name="Agilent/MassHunter peak picking"/>
        </processingMethod>
      </dataProcessing>

    <run id="_x0030_001_LIP1p_20150803_008_CHCl3-MeOH_1_1" defaultInstrumentConfigurationRef="IC1" startTimeStamp="2015-08-03T14:34:14Z" defaultSourceFileRef="MSScan.bin">

    </run>
  </mzML>
</indexedmzML>

Upvotes: 7

Views: 1855

Answers (1)

fmic_
fmic_

Reputation: 2446

If you want to extract all the startTimeStamp values from your XML file, you can do:

xml %>% xml_find_all("//@startTimeStamp") %>% xml_text()

EDIT:

If you want to select it by name, then you need to worry about namespaces.

Indeed,

xml %>% xml_child("mzML")

will return

{xml_missing}
<NA>

you first need to check the namespaces associated with your XML file:

xml_ns(xml)
# d1   <-> http://psi.hupo.org/ms/mzml
# d2   <-> http://psi.hupo.org/ms/mzml
# xsi  <-> http://www.w3.org/2001/XMLSchema-instance
# xsi1 <-> http://www.w3.org/2001/XMLSchema-instance

so you'd need to use:

xml %>% xml_child("d1:mzML")

For the full path to the attribute you're interested in:

xml %>% xml_child("d1:mzML") %>% xml_child("d1:run") %>% xml_attr("startTimeStamp")

The documentation of the xml_ns() gives a little more information, and encourages you to rename your namespaces with more informative names.

Upvotes: 8

Related Questions