Matt Tracy
Matt Tracy

Reputation: 31

Converting XML document to R dataframe where subnodes all have the same name and use childnode for its identification

I'm wanting to convert the following xml document into a dataframe I can use to access the coordinate information.

a3 <- xmlParse("test.xml")

I figured I will have to do some kind of apply function, but I'm still not certain how to handle xml subnodes with many subnodes that are used to identify that particular node. <Marker_Data> and <Marker> nodes use a subnode and sister node (respectively) for their unique identification. Is this a straightforward problem? The examples I've seen on stack overflow have unique xml attributes at each level to identify the node: example - Parsing XML with different number of subnode with same name in R or Extract second attribute of a xml node in R (XML package)

I would like to make a dataframe like this:

| Filename | Type| Xaxis| Yaxis | Zaxis |

| File.tif |  1  | 7172 | 4332  |   1   |

| File.tif |  1  | 7170 | 4140  |   1   |

| File.tif |  2  | 6172 | 4332  |   1   |

| File.tif |  2  | 5173 | 4140  |   1   |

test.XML:

<?xml version="1.0" encoding="UTF-8"?>
<CellCounter_Marker_File>
 <Image_Properties>
     <Image_Filename>File.tif</Image_Filename>
 </Image_Properties>
 <Marker_Data>
     <Marker_Type>
         <Type>1</Type>
         <Marker>
             <MarkerX>7172</MarkerX>
             <MarkerY>4332</MarkerY>
             <MarkerZ>1</MarkerZ>
         </Marker>
         <Marker>
             <MarkerX>7170</MarkerX>
             <MarkerY>4140</MarkerY>
             <MarkerZ>1</MarkerZ>
         </Marker>
     </Marker_Type>
     <Marker_Type>
         <Type>2</Type>
         <Marker>
             <MarkerX>6172</MarkerX>
             <MarkerY>4332</MarkerY>
             <MarkerZ>1</MarkerZ>
         </Marker>
         <Marker>
             <MarkerX>5173</MarkerX>
             <MarkerY>4140</MarkerY>
             <MarkerZ>1</MarkerZ>
         </Marker>
     </Marker_Type>
 </Marker_Data>
</CellCounter_Marker_File>

Upvotes: 3

Views: 1179

Answers (2)

hrbrmstr
hrbrmstr

Reputation: 78792

👍🏼 on a well-crafted second SO question!

Since you're using the XML package we'll keep you in that universe with this answer:

library(XML)

doc <- xmlParse("~/Data/sample.xml")

xpathApply(doc, "//Marker", function(.x) {

  tmp <- xmlToList(.x) # quick way to get the kids into a native R structure

  tmp$Type <- xmlValue(getNodeSet(.x, "../Type")[[1]]) # go back one and get the type

  # go to the root and get the filename (which we can and should do outside the apply)
  xmlValue(
    getNodeSet(.x, "//CellCounter_Marker_File/Image_Properties/Image_Filename")[[1]]
  ) -> tmp$Filename

  # change the names to protect the innocent  
  names(tmp) <- gsub("Marker([[:alpha:]])", "\\1axis", names(tmp))

  tmp

}) -> res

saf <- default.stringsAsFactors()
options("stringsAsFactors" = FALSE) # assuming you want strings vs factors

You got lucky with this XML file as the target nodes have consistent kids (well, it's likely more than luck as this looks to be a highly structured file but we'd have to do more work after the xpathApply() call if there was a chance there was variation in the child nodes of <Marker>):

res <- do.call(rbind.data.frame, res) 
res <- type.convert(res, as.is = TRUE)
res <- res[,c("Filename", "Type", "Xaxis", "Yaxis", "Zaxis")]
rownames(res) <- NULL

options("stringsAsFactors" = saf)

res
##   Filename Type Xaxis Yaxis Zaxis
## 1 File.tif    1  7172  4332     1
## 2 File.tif    1  7170  4140     1
## 3 File.tif    2  6172  4332     1
## 4 File.tif    2  5173  4140     1

str(res)
## 'data.frame': 4 obs. of  5 variables:
##  $ Filename: chr  "File.tif" "File.tif" "File.tif" "File.tif"
##  $ Type    : int  1 1 2 2
##  $ Xaxis   : int  7172 7170 6172 5173
##  $ Yaxis   : int  4332 4140 4332 4140
##  $ Zaxis   : int  1 1 1 1

Upvotes: 1

Wimpel
Wimpel

Reputation: 27732

Since the nodes Marker are the ones who are defining your row, you should use them as the basis for looping. Get the Marker-nodes using

xml_find_all(doc, ".//Marker")

You then iterate over the Marker-nodes, and use xpath to find to find it's relevant children (the MarkerX, MarkerY and MarkerZ), but also to find it's sibling (Type) and ancestor (Image_Filename) nodes... Once found, xml_text() is used to extract their values.

code

library( xml2 )
library( tidyverse )

df <-
  #get all markers, since thay are defining the rows
  xml_find_all(doc, ".//Marker") %>%
  #iterate over the markers, and get relevant data from child-, ancestor- and sibling-nodes
  map_df( function(x) {
    set_names( c( #get the filename, which is in the preceeding sibling 'Type' of the marker 
                  xml_find_all( x, "./ancestor::Marker_Data/preceding-sibling::Image_Properties/Image_Filename") %>% xml_text(),
                  #get the types, whoich are in the preceeding sibling 'Type' of the marker 
                  xml_find_all( x, "./preceding-sibling::Type") %>% xml_text(),
                  #get the MarkersX, Y and Z of the marker 
                  xml_find_all( x, ".//MarkerX") %>% xml_text(),
                  xml_find_all( x, ".//MarkerY") %>% xml_text(),
                  xml_find_all( x, ".//MarkerZ") %>% xml_text() ), 
               #set the column names
               c( "File", "Type", "Xaxis", "Yaxis", "z") ) %>% 
      as.list() %>% #make list
      flatten_df() #flatten to a data.frame
  }) %>%
  type_convert() #let R convert the values for you

output

# # A tibble: 4 x 5
#   File      Type Xaxis Yaxis     z
#   <chr>    <int> <int> <int> <int>
# 1 File.tif     1  7172  4332     1
# 2 File.tif     1  7170  4140     1
# 3 File.tif     2  6172  4332     1
# 4 File.tif     2  5173  4140     1

sample data

doc <- read_xml( '<?xml version="1.0" encoding="UTF-8"?>
<CellCounter_Marker_File>
<Image_Properties>
<Image_Filename>File.tif</Image_Filename>
</Image_Properties>
<Marker_Data>
<Marker_Type>
<Type>1</Type>
<Marker>
<MarkerX>7172</MarkerX>
<MarkerY>4332</MarkerY>
<MarkerZ>1</MarkerZ>
</Marker>
<Marker>
<MarkerX>7170</MarkerX>
<MarkerY>4140</MarkerY>
<MarkerZ>1</MarkerZ>
</Marker>
</Marker_Type>
<Marker_Type>
<Type>2</Type>
<Marker>
<MarkerX>6172</MarkerX>
<MarkerY>4332</MarkerY>
<MarkerZ>1</MarkerZ>
</Marker>
<Marker>
<MarkerX>5173</MarkerX>
<MarkerY>4140</MarkerY>
<MarkerZ>1</MarkerZ>
</Marker>
</Marker_Type>
</Marker_Data>
</CellCounter_Marker_File>' )

Upvotes: 1

Related Questions