Reputation: 31
I'm wanting to convert the following xml document into a dataframe I can use to access the coordinate information.
a3 <- xmlParse("test.xml")
I figured I will have to do some kind of apply function, but I'm still not certain how to handle xml subnodes with many subnodes that are used to identify that particular node. <Marker_Data>
and <Marker>
nodes use a subnode and sister node (respectively) for their unique identification. Is this a straightforward problem? The examples I've seen on stack overflow have unique xml attributes at each level to identify the node: example - Parsing XML with different number of subnode with same name in R or Extract second attribute of a xml node in R (XML package)
I would like to make a dataframe like this:
| Filename | Type| Xaxis| Yaxis | Zaxis |
| File.tif | 1 | 7172 | 4332 | 1 |
| File.tif | 1 | 7170 | 4140 | 1 |
| File.tif | 2 | 6172 | 4332 | 1 |
| File.tif | 2 | 5173 | 4140 | 1 |
test.XML:
<?xml version="1.0" encoding="UTF-8"?>
<CellCounter_Marker_File>
<Image_Properties>
<Image_Filename>File.tif</Image_Filename>
</Image_Properties>
<Marker_Data>
<Marker_Type>
<Type>1</Type>
<Marker>
<MarkerX>7172</MarkerX>
<MarkerY>4332</MarkerY>
<MarkerZ>1</MarkerZ>
</Marker>
<Marker>
<MarkerX>7170</MarkerX>
<MarkerY>4140</MarkerY>
<MarkerZ>1</MarkerZ>
</Marker>
</Marker_Type>
<Marker_Type>
<Type>2</Type>
<Marker>
<MarkerX>6172</MarkerX>
<MarkerY>4332</MarkerY>
<MarkerZ>1</MarkerZ>
</Marker>
<Marker>
<MarkerX>5173</MarkerX>
<MarkerY>4140</MarkerY>
<MarkerZ>1</MarkerZ>
</Marker>
</Marker_Type>
</Marker_Data>
</CellCounter_Marker_File>
Upvotes: 3
Views: 1179
Reputation: 78792
👍🏼 on a well-crafted second SO question!
Since you're using the XML
package we'll keep you in that universe with this answer:
library(XML)
doc <- xmlParse("~/Data/sample.xml")
xpathApply(doc, "//Marker", function(.x) {
tmp <- xmlToList(.x) # quick way to get the kids into a native R structure
tmp$Type <- xmlValue(getNodeSet(.x, "../Type")[[1]]) # go back one and get the type
# go to the root and get the filename (which we can and should do outside the apply)
xmlValue(
getNodeSet(.x, "//CellCounter_Marker_File/Image_Properties/Image_Filename")[[1]]
) -> tmp$Filename
# change the names to protect the innocent
names(tmp) <- gsub("Marker([[:alpha:]])", "\\1axis", names(tmp))
tmp
}) -> res
saf <- default.stringsAsFactors()
options("stringsAsFactors" = FALSE) # assuming you want strings vs factors
You got lucky with this XML file as the target nodes have consistent kids (well, it's likely more than luck as this looks to be a highly structured file but we'd have to do more work after the xpathApply()
call if there was a chance there was variation in the child nodes of <Marker>
):
res <- do.call(rbind.data.frame, res)
res <- type.convert(res, as.is = TRUE)
res <- res[,c("Filename", "Type", "Xaxis", "Yaxis", "Zaxis")]
rownames(res) <- NULL
options("stringsAsFactors" = saf)
res
## Filename Type Xaxis Yaxis Zaxis
## 1 File.tif 1 7172 4332 1
## 2 File.tif 1 7170 4140 1
## 3 File.tif 2 6172 4332 1
## 4 File.tif 2 5173 4140 1
str(res)
## 'data.frame': 4 obs. of 5 variables:
## $ Filename: chr "File.tif" "File.tif" "File.tif" "File.tif"
## $ Type : int 1 1 2 2
## $ Xaxis : int 7172 7170 6172 5173
## $ Yaxis : int 4332 4140 4332 4140
## $ Zaxis : int 1 1 1 1
Upvotes: 1
Reputation: 27732
Since the nodes Marker
are the ones who are defining your row, you should use them as the basis for looping. Get the Marker
-nodes using
xml_find_all(doc, ".//Marker")
You then iterate over the Marker-nodes, and use xpath
to find to find it's relevant children (the MarkerX
, MarkerY
and MarkerZ
), but also to find it's sibling (Type
) and ancestor (Image_Filename
) nodes... Once found, xml_text()
is used to extract their values.
code
library( xml2 )
library( tidyverse )
df <-
#get all markers, since thay are defining the rows
xml_find_all(doc, ".//Marker") %>%
#iterate over the markers, and get relevant data from child-, ancestor- and sibling-nodes
map_df( function(x) {
set_names( c( #get the filename, which is in the preceeding sibling 'Type' of the marker
xml_find_all( x, "./ancestor::Marker_Data/preceding-sibling::Image_Properties/Image_Filename") %>% xml_text(),
#get the types, whoich are in the preceeding sibling 'Type' of the marker
xml_find_all( x, "./preceding-sibling::Type") %>% xml_text(),
#get the MarkersX, Y and Z of the marker
xml_find_all( x, ".//MarkerX") %>% xml_text(),
xml_find_all( x, ".//MarkerY") %>% xml_text(),
xml_find_all( x, ".//MarkerZ") %>% xml_text() ),
#set the column names
c( "File", "Type", "Xaxis", "Yaxis", "z") ) %>%
as.list() %>% #make list
flatten_df() #flatten to a data.frame
}) %>%
type_convert() #let R convert the values for you
output
# # A tibble: 4 x 5
# File Type Xaxis Yaxis z
# <chr> <int> <int> <int> <int>
# 1 File.tif 1 7172 4332 1
# 2 File.tif 1 7170 4140 1
# 3 File.tif 2 6172 4332 1
# 4 File.tif 2 5173 4140 1
sample data
doc <- read_xml( '<?xml version="1.0" encoding="UTF-8"?>
<CellCounter_Marker_File>
<Image_Properties>
<Image_Filename>File.tif</Image_Filename>
</Image_Properties>
<Marker_Data>
<Marker_Type>
<Type>1</Type>
<Marker>
<MarkerX>7172</MarkerX>
<MarkerY>4332</MarkerY>
<MarkerZ>1</MarkerZ>
</Marker>
<Marker>
<MarkerX>7170</MarkerX>
<MarkerY>4140</MarkerY>
<MarkerZ>1</MarkerZ>
</Marker>
</Marker_Type>
<Marker_Type>
<Type>2</Type>
<Marker>
<MarkerX>6172</MarkerX>
<MarkerY>4332</MarkerY>
<MarkerZ>1</MarkerZ>
</Marker>
<Marker>
<MarkerX>5173</MarkerX>
<MarkerY>4140</MarkerY>
<MarkerZ>1</MarkerZ>
</Marker>
</Marker_Type>
</Marker_Data>
</CellCounter_Marker_File>' )
Upvotes: 1