MrTommek
MrTommek

Reputation: 103

R parsing XML tree with hierarchical data to dataframe

I am trying to parse some xml documents in R XML--. DataFrame. What I want to do is flatten the XML tree so that I get one row in data frame per each, child. Also I want for each row to contain data from parent

example:

<xml>
    <eventlist>
        <event>
            <ProcessIndex>1063</ProcessIndex>
            <Time_of_Day>2:54:20.2959537 PM</Time_of_Day>
            <Process_Name>chrome.exe</Process_Name>
            <PID>12164</PID>
            <Operation>ReadFile</Operation>
            <Result>SUCCESS</Result>
            <Detail>Offset: 1,684,224, Length: 256</Detail>
            <stack>
                <frame>
                    <depth>0</depth>
                    <address>0xfffff8038683667c</address>
                    <path>C:\WINDOWS\System32\drivers\FLTMGR.SYS</path>
                    <location>FltDecodeParameters + 0x1a6c</location>
                </frame>
                <frame>
                    <depth>1</depth>
                    <address>0xfffff80386834e13</address>
                    <path>C:\WINDOWS\System32\drivers\FLTMGR.SYS</path>
                    <location>FltDecodeParameters + 0x203</location>
                </frame>
                <frame>
                <depth>3</depth>
                    <address>0x7ffea54ffac1</address>
                    <path>C:\WINDOWS\SYSTEM32\ntdll.dll</path>
                    <location>RtlUserThreadStart + 0x21</location>
                </frame>
            </stack>
        </event>
        <event>
            <ProcessIndex>1063</ProcessIndex>
            <Time_of_Day>2:54:20.2960270 PM</Time_of_Day>
            <Process_Name>chrome.exe</Process_Name>
            <PID>12164</PID>
            <Operation>WriteFile</Operation>
            <Result>SUCCESS</Result>
            <Detail>Offset: 103,016, Length: 36</Detail>
            <stack>
                <frame>
                    <depth>0</depth>
                    <address>0xfffff8038683667c</address>
                    <path>C:\WINDOWS\System32\drivers\FLTMGR.SYS</path>
                    <location>FltDecodeParameters + 0x1a6c</location>
                </frame>
                <frame>
                    <depth>1</depth>
                    <address>0xfffff80386834e13</address>
                    <path>C:\WINDOWS\System32\drivers\FLTMGR.SYS</path>
                    <location>FltDecodeParameters + 0x203</location>
                </frame>
                <frame>
                    <depth>26</depth>
                    <address>0x7ffea54ffac1</address>
                    <path>C:\WINDOWS\SYSTEM32\ntdll.dll</path>
                    <location>RtlUserThreadStart + 0x21</location>
                </frame>
            </stack>
        </event>
    </eventlist>
</xml>

And the result that I would like to get is

ProcesnIndex     Time_of_day    Proces_Name     PID     Operation   Result  depth   address     path            location
1063             2:54:20        chrome.exe      12164   ReadFile    SUCCESS 0       0xfffff..   C:\WINDOWS\System32\driv... FltDecodeParameters + 0x1a6c
1063             2:54:20        chrome.exe      12164   ReadFile    SUCCESS 1       0xfffff..   C:\WINDOWS\System32\driv... FltDecodeParameters + 0x203
1063             2:54:20        chrome.exe      12164   ReadFile    SUCCESS 2       0xfffff..   C:\WINDOWS\System32\driv... tlUserThreadStart + 0x21
1063             2:54:20        chrome.exe      12164   WriteFile   SUCCESS 0       0xfffff..   C:\WINDOWS\System32\driv... FltDecodeParameters + 0x1a6c
1063             2:54:20        chrome.exe      12164   WriteFile   SUCCESS 1       0xfffff..   C:\WINDOWS\System32\driv... FltDecodeParameters + 0x203
1063             2:54:20        chrome.exe      12164   WriteFile   SUCCESS 2       0xfffff..   C:\WINDOWS\System32\driv... RtlUserThreadStart + 0x21

I tried using XML package and xmlToDataFrame

xmldf_events_stack <- xmlToDataFrame(nodes=getNodeSet(data_xml_2,"//eventlist/event/stack/frame"))

but that only gives me flatten frames without parent data. Also If I try to parse event data to dataframe, all XML tags are removed from frame field so there is no way for me to parse that later.

Any help or guid in right direction will be appreciated

Upvotes: 2

Views: 1054

Answers (2)

Parfait
Parfait

Reputation: 107642

Consider parsing by node index, [##], and then merge the parent with children in a lapply for list of dataframes to be row-binded altogether:

doc <- xmlParse("/path/to/XML/file.xml")

xml_len <- length(getNodeSet(doc,"//eventlist/event"))

dflist <- lapply(seq(xml_len), function(i){   
  # PARENT NODES   
  d1 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//eventlist/event[",i,"]"))), key=1)
  # CHILD NODES
  d2 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//eventlist/event[",i,"]/stack/frame"))), key=1) 

  # MERGE ON KEY, THEN DROP KEY
  merge(d1, d2, by="key")[-1]      
})

xmldf_events_stack <- do.call(rbind, dflist)

Upvotes: 0

MrTommek
MrTommek

Reputation: 103

I solved problem, I am sure there is more elegant way to do this but this is what I did. Hope it helps somebody in the future

df <- do.call(rbind.fill, lapply(data_xml_2['//eventlist/event'], function(x) { 
  names <- xpathSApply(x, './/.', xmlName) 
  names <- names[which(names == "text") - 1]
  values <- xpathSApply(x, ".//text()", xmlValue)
  framevalues <- values[8:length(values)]
  framevalues <- matrix(framevalues, ncol = 4, byrow = TRUE)

  retvalues <- framevalues
  for(i in 7:1){
    retvalues <- cbind(values[i],retvalues)
  }
  colnames(retvalues) <- names[1:12] 
  return(as.data.frame(retvalues))
}))

Upvotes: 4

Related Questions