Reputation: 103
I am trying to parse some xml documents in R XML--. DataFrame. What I want to do is flatten the XML tree so that I get one row in data frame per each, child. Also I want for each row to contain data from parent
example:
<xml>
<eventlist>
<event>
<ProcessIndex>1063</ProcessIndex>
<Time_of_Day>2:54:20.2959537 PM</Time_of_Day>
<Process_Name>chrome.exe</Process_Name>
<PID>12164</PID>
<Operation>ReadFile</Operation>
<Result>SUCCESS</Result>
<Detail>Offset: 1,684,224, Length: 256</Detail>
<stack>
<frame>
<depth>0</depth>
<address>0xfffff8038683667c</address>
<path>C:\WINDOWS\System32\drivers\FLTMGR.SYS</path>
<location>FltDecodeParameters + 0x1a6c</location>
</frame>
<frame>
<depth>1</depth>
<address>0xfffff80386834e13</address>
<path>C:\WINDOWS\System32\drivers\FLTMGR.SYS</path>
<location>FltDecodeParameters + 0x203</location>
</frame>
<frame>
<depth>3</depth>
<address>0x7ffea54ffac1</address>
<path>C:\WINDOWS\SYSTEM32\ntdll.dll</path>
<location>RtlUserThreadStart + 0x21</location>
</frame>
</stack>
</event>
<event>
<ProcessIndex>1063</ProcessIndex>
<Time_of_Day>2:54:20.2960270 PM</Time_of_Day>
<Process_Name>chrome.exe</Process_Name>
<PID>12164</PID>
<Operation>WriteFile</Operation>
<Result>SUCCESS</Result>
<Detail>Offset: 103,016, Length: 36</Detail>
<stack>
<frame>
<depth>0</depth>
<address>0xfffff8038683667c</address>
<path>C:\WINDOWS\System32\drivers\FLTMGR.SYS</path>
<location>FltDecodeParameters + 0x1a6c</location>
</frame>
<frame>
<depth>1</depth>
<address>0xfffff80386834e13</address>
<path>C:\WINDOWS\System32\drivers\FLTMGR.SYS</path>
<location>FltDecodeParameters + 0x203</location>
</frame>
<frame>
<depth>26</depth>
<address>0x7ffea54ffac1</address>
<path>C:\WINDOWS\SYSTEM32\ntdll.dll</path>
<location>RtlUserThreadStart + 0x21</location>
</frame>
</stack>
</event>
</eventlist>
</xml>
And the result that I would like to get is
ProcesnIndex Time_of_day Proces_Name PID Operation Result depth address path location
1063 2:54:20 chrome.exe 12164 ReadFile SUCCESS 0 0xfffff.. C:\WINDOWS\System32\driv... FltDecodeParameters + 0x1a6c
1063 2:54:20 chrome.exe 12164 ReadFile SUCCESS 1 0xfffff.. C:\WINDOWS\System32\driv... FltDecodeParameters + 0x203
1063 2:54:20 chrome.exe 12164 ReadFile SUCCESS 2 0xfffff.. C:\WINDOWS\System32\driv... tlUserThreadStart + 0x21
1063 2:54:20 chrome.exe 12164 WriteFile SUCCESS 0 0xfffff.. C:\WINDOWS\System32\driv... FltDecodeParameters + 0x1a6c
1063 2:54:20 chrome.exe 12164 WriteFile SUCCESS 1 0xfffff.. C:\WINDOWS\System32\driv... FltDecodeParameters + 0x203
1063 2:54:20 chrome.exe 12164 WriteFile SUCCESS 2 0xfffff.. C:\WINDOWS\System32\driv... RtlUserThreadStart + 0x21
I tried using XML package and xmlToDataFrame
xmldf_events_stack <- xmlToDataFrame(nodes=getNodeSet(data_xml_2,"//eventlist/event/stack/frame"))
but that only gives me flatten frames without parent data. Also If I try to parse event data to dataframe, all XML tags are removed from frame field so there is no way for me to parse that later.
Any help or guid in right direction will be appreciated
Upvotes: 2
Views: 1054
Reputation: 107642
Consider parsing by node index, [##]
, and then merge the parent with children in a lapply
for list of dataframes to be row-binded altogether:
doc <- xmlParse("/path/to/XML/file.xml")
xml_len <- length(getNodeSet(doc,"//eventlist/event"))
dflist <- lapply(seq(xml_len), function(i){
# PARENT NODES
d1 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//eventlist/event[",i,"]"))), key=1)
# CHILD NODES
d2 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//eventlist/event[",i,"]/stack/frame"))), key=1)
# MERGE ON KEY, THEN DROP KEY
merge(d1, d2, by="key")[-1]
})
xmldf_events_stack <- do.call(rbind, dflist)
Upvotes: 0
Reputation: 103
I solved problem, I am sure there is more elegant way to do this but this is what I did. Hope it helps somebody in the future
df <- do.call(rbind.fill, lapply(data_xml_2['//eventlist/event'], function(x) {
names <- xpathSApply(x, './/.', xmlName)
names <- names[which(names == "text") - 1]
values <- xpathSApply(x, ".//text()", xmlValue)
framevalues <- values[8:length(values)]
framevalues <- matrix(framevalues, ncol = 4, byrow = TRUE)
retvalues <- framevalues
for(i in 7:1){
retvalues <- cbind(values[i],retvalues)
}
colnames(retvalues) <- names[1:12]
return(as.data.frame(retvalues))
}))
Upvotes: 4