Reputation: 483
I am trying to parse an XML file nodes and attributes. Within the file there is a set of nodes with attributes. Nested XML structure is similar to a data frame with a I want to parse this into a data frame.
Here is an example file:
<?xml version="1.0" encoding="UTF-8"?>
<TrackMate version="3.8.0">
<Model spatialunits="µm" timeunits="sec">
<AllTracks>
<Track name="Track_2" TRACK_ID="2" NUMBER_SPOTS="140" NUMBER_GAPS="0" >
<Edge SPOT_SOURCE_ID="960769" SPOT_TARGET_ID="960778" LINK_COST="0.08756957830926632" />
<Edge SPOT_SOURCE_ID="958304" SPOT_TARGET_ID="958308" LINK_COST="1.4003359672950089" />
<Edge SPOT_SOURCE_ID="958316" SPOT_TARGET_ID="958322" LINK_COST="1.6985623204008202" />
</Track>
<Track name="Track_145" TRACK_ID="145" NUMBER_SPOTS="141" NUMBER_GAPS="0" >
<Edge SPOT_SOURCE_ID="961623" SPOT_TARGET_ID="961628" LINK_COST="2.2678642015413755" />
<Edge SPOT_SOURCE_ID="962122" SPOT_TARGET_ID="962127" LINK_COST="38.20777704254654" />
<Edge SPOT_SOURCE_ID="961869" SPOT_TARGET_ID="961873" LINK_COST="0.2895609647324684" />
</Track>
</AllTracks>
</Model>
</TrackMate>
I would like like create a data frame with all attributes of edges and parent's TRACK_ID attribute. I can readily create the data frame with all the edges' attributes with this:
edges = data.frame(t(data.frame(xml_attrs(xml_find_all(xmlDoc, xpath = paste0('/TrackMate/Model/AllTracks//Edge'))))))
row.names(edges) = NULL
But then the corresponding track ID is lost. I can solve this with a for loop but that is often not the "R way". I was wondering if, there are is a simpler solution? (e.g. with xpath query).
So the final desired output would be this data frame:
Edit: this comes closer but the then Track nodes and Edge nodes are mixed within a list.
xml_find_all(xmlDoc, xpath = paste0('/TrackMate/Model/AllTracks//Edge | /TrackMate/Model/AllTracks/Track'))
Upvotes: 2
Views: 1061
Reputation: 27732
The 'trick' is to get a list of alle the edge-nodes, and work with xpath
from there... You can select the Trach-node from each Edge-node using the ancestor
from xpath
.
libraries used
#load libraries
library( xml2 )
library( magrittr )
sample data
doc <- read_xml('<?xml version="1.0" encoding="UTF-8"?>
<TrackMate version="3.8.0">
<Model spatialunits="µm" timeunits="sec">
<AllTracks>
<Track name="Track_2" TRACK_ID="2" NUMBER_SPOTS="140" NUMBER_GAPS="0" >
<Edge SPOT_SOURCE_ID="960769" SPOT_TARGET_ID="960778" LINK_COST="0.08756957830926632" />
<Edge SPOT_SOURCE_ID="958304" SPOT_TARGET_ID="958308" LINK_COST="1.4003359672950089" />
<Edge SPOT_SOURCE_ID="958316" SPOT_TARGET_ID="958322" LINK_COST="1.6985623204008202" />
</Track>
<Track name="Track_145" TRACK_ID="145" NUMBER_SPOTS="141" NUMBER_GAPS="0" >
<Edge SPOT_SOURCE_ID="961623" SPOT_TARGET_ID="961628" LINK_COST="2.2678642015413755" />
<Edge SPOT_SOURCE_ID="962122" SPOT_TARGET_ID="962127" LINK_COST="38.20777704254654" />
<Edge SPOT_SOURCE_ID="961869" SPOT_TARGET_ID="961873" LINK_COST="0.2895609647324684" />
</Track>
</AllTracks>
</Model>
</TrackMate>')
code
#find all edge nodes
edge.nodes <- xml_find_all( doc, ".//Edge")
#build the data.frame
data.frame( TRACK_ID = xml_find_first( edge.nodes, ".//ancestor::Track") %>% xml_attr("TRACK_ID"),
SPOT_SOURCE_ID = edge.nodes %>% xml_attr("SPOT_SOURCE_ID"),
SPOT_TARGET_ID = edge.nodes %>% xml_attr("SPOT_TARGET_ID"),
LINK_COST = edge.nodes %>% xml_attr("LINK_COST") )
output
# TRACK_ID SPOT_SOURCE_ID SPOT_TARGET_ID LINK_COST
# 1 2 960769 960778 0.08756957830926632
# 2 2 958304 958308 1.4003359672950089
# 3 2 958316 958322 1.6985623204008202
# 4 145 961623 961628 2.2678642015413755
# 5 145 962122 962127 38.20777704254654
# 6 145 961869 961873 0.2895609647324684
Upvotes: 5