Reputation: 2206
I'm trying to parse information from a XML file that contains a lot of elements with repeating names.
Here is an example of the type of file I am trying to parse, containing only one record:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<array>
<!--
Start of the FIRST record.
-->
<dict>
<key>80211D_IE</key>
<dict>
<key>IE_KEY_80211D_CHAN_INFO_ARRAY</key>
<array>
<dict>
<key>IE_KEY_80211D_FIRST_CHANNEL</key>
<integer>1</integer>
<key>IE_KEY_80211D_MAX_POWER</key>
<integer>27</integer>
<key>IE_KEY_80211D_NUM_CHANNELS</key>
<integer>11</integer>
</dict>
</array>
<key>IE_KEY_80211D_COUNTRY_CODE</key>
<string>US</string>
</dict>
<key>AGE</key>
<integer>0</integer>
<key>AP_MODE</key>
<integer>2</integer>
<key>BEACON_INT</key>
<integer>100</integer>
<key>BSSID</key>
<string>ac:5d:10:73:c3:11</string>
<key>CAPABILITIES</key>
<integer>1073</integer>
<key>CHANNEL</key>
<integer>2</integer>
<key>CHANNEL_FLAGS</key>
<integer>10</integer>
<key>IE</key>
<data>
AAZPbGl2ZXIBCIKEiwwSlhgkAwECBwZVUyABCxswGAEAAA+sAgIAAA+sBAAP
rAIBAAAPrAIAAN0aAFDyAQEAAFDyAgIAAFDyBABQ8gIBAABQ8gIqAQAyBDBI
YGw=
</data>
<key>NOISE</key>
<integer>0</integer>
<key>RATES</key>
<array>
<integer>1</integer>
<integer>2</integer>
<integer>5</integer>
<integer>6</integer>
<integer>9</integer>
<integer>11</integer>
<integer>12</integer>
<integer>18</integer>
<integer>24</integer>
<integer>36</integer>
<integer>48</integer>
<integer>54</integer>
</array>
<key>RSN_IE</key>
<dict>
<key>IE_KEY_RSN_AUTHSELS</key>
<array>
<integer>2</integer>
</array>
<key>IE_KEY_RSN_MCIPHER</key>
<integer>2</integer>
<key>IE_KEY_RSN_UCIPHERS</key>
<array>
<integer>4</integer>
<integer>2</integer>
</array>
<key>IE_KEY_RSN_VERSION</key>
<integer>1</integer>
</dict>
<key>RSSI</key>
<integer>-74</integer>
<key>SSID</key>
<data>
T2xpdmVy
</data>
<key>SSID_STR</key>
<string>Oliver</string>
<key>WPA_IE</key>
<dict>
<key>IE_KEY_WPA_AUTHSELS</key>
<array>
<integer>2</integer>
</array>
<key>IE_KEY_WPA_MCIPHER</key>
<integer>2</integer>
<key>IE_KEY_WPA_UCIPHERS</key>
<array>
<integer>4</integer>
<integer>2</integer>
</array>
<key>IE_KEY_WPA_VERSION</key>
<integer>1</integer>
</dict>
</dict>
<!--
End of the FIRST record.
In reality, more records follow.
-->
</array>
</plist>
The problem that I have is that I want to essentially flatten each observation (only one observation in the example above) so that each element within the first <array>
(i.e. each <dict>
within <array>
) is a row in a data frame and each element within that <dict>
is a column, named by the appropriate <key>
.
I have experimented with functions in the XML
package, mostly xmlToList
, but haven't quite been able to figure out the correct way of parsing XML data.
Edit:
My desired output is to more-or-less flatten each record to a row in a data frame or in a list so that values can be easily accessed by key. I don't necessarily care about keeping any hierarchical structure, for example every record will have <key>80211D_IE</key>
followed by a dict
that contains actual information -- the <key>80211D_IE</key>
isn't necessary because it doesn't contain any real information but is just an unnecessary grouping on a set of items. I'm fine either storing this as a list, something like mydata$record1$X80211D_IE$I.E._KEY_80211D_CHAN_INFO_ARRAY$IE_KEY_80211D_FIRST_CHANNEL
, or in a data frame like mydata[1, 'I.E._KEY_80211D_FIRST_CHANNEL']
.
The biggest problem I have right now is that this XML structure does not seem to lend itself to parsing very well. For example if I want to subset the XML to records where the SSID_STR
matches a string, I can't just use xmlToList
because it doesn't know that keys should associated with their values. So I get a list like this:
> str(xmlToList("path/to/my/file.xml"), max.level=2)
List of 2
$ array :List of 25
..$ dict:List of 36
..$ dict:List of 32
..$ dict:List of 32
..$ dict:List of 38
..$ dict:List of 36
..$ dict:List of 34
..$ dict:List of 34
..$ dict:List of 34
..$ dict:List of 34
..$ dict:List of 34
..$ dict:List of 32
..$ dict:List of 38
..$ dict:List of 38
..$ dict:List of 34
..$ dict:List of 36
..$ dict:List of 34
..$ dict:List of 36
..$ dict:List of 34
..$ dict:List of 36
..$ dict:List of 36
..$ dict:List of 40
..$ dict:List of 42
..$ dict:List of 36
..$ dict:List of 38
..$ dict:List of 38
$ .attrs: Named chr "1.0"
..- attr(*, "names")= chr "version"
and looking at a single one of these
> str(xmlToList("path/to/my/file.xml")$array[[1]], max.level = 1)
List of 36
$ key : chr "80211D_IE"
$ dict :List of 4
$ key : chr "AGE"
$ integer: chr "0"
$ key : chr "AP_MODE"
$ integer: chr "2"
$ key : chr "BEACON_INT"
$ integer: chr "100"
$ key : chr "BSSID"
$ string : chr "a:18:a:31:0:83"
$ key : chr "CAPABILITIES"
$ integer: chr "4145"
$ key : chr "CHANNEL"
$ integer: chr "11"
$ key : chr "CHANNEL_FLAGS"
$ integer: chr "10"
$ key : chr "HT_CAPS_IE"
$ dict :List of 12
$ key : chr "HT_IE"
$ dict :List of 34
$ key : chr "IE"
$ data : chr "\n\t\tAAR0ZXN0AQiWlgwSGCQwSAMBCwcGVVMgAQseKgEDMBgBAAAPrAICAAAPrAQA\n\t\tD6wCAQAAD6wCAAAyAmBsRgVzwAEAADMCDAstGowRG///AAAAAAAAAAA"| __truncated__
$ key : chr "NOISE"
$ integer: chr "0"
$ key : chr "RATES"
$ array :List of 9
$ key : chr "RSN_IE"
$ dict :List of 8
$ key : chr "RSSI"
$ integer: chr "-86"
$ key : chr "SSID"
$ data : chr "\n\t\tdGVzdA==\n\t\t"
$ key : chr "SSID_STR"
$ string : chr "test"
$ key : chr "WPA_IE"
$ dict :List of 8
it is easy to see that there are really only 18 items but the keys are stored as their own items (making 36).
The xmlToList
function actually does almost what I want it do to -- but rather to name the elements of the list containing the data using the value of the corresponding key.
This would look like:
List of 18
$ AGE : chr "0"
$ AP_MODE : chr "2"
$ BEACON_INT : chr "100"
$ BSSID : chr "a:18:a:31:0:83"
$ CAPABILITIES : chr "4145"
$ CHANNEL : chr "11"
$ CHANNEL_FLAGS: chr "10"
$ HT_CAPS_IE :List of 12
$ HT_IE :List of 34
$ IE : chr "\n\t\tAAR0ZXN0AQiWlgwSGCQwSAMBCwcGVVMgAQseKgEDMBgBAAAPrAICAAAPrAQA\n\t\tD6wCAQAAD6wCAAAyAmBsRgVzwAEAADMCDAstGowRG///AAAAAAAAAAA"| __truncated__
$ NOISE : chr "0"
$ RATES :List of 9
$ RSN_IE :List of 8
$ RSSI : chr "-86"
$ SSID : chr "\n\t\tdGVzdA==\n\t\t"
$ SSID_STR : chr "test"
$ WPA_IE :List of 8
$ X80211D_IE :List of 4
In this hypothetical output, it would be easy to get values using the appropriate key. Further it would be easy to continue to unnest lists (since the grouping structure is unnecessary) to produce a data frame.
Upvotes: 0
Views: 808
Reputation: 2601
New answer after the significant edit to the question.
I stored OP's XML in a file BUT DUPLICATED THE SINGLE RECORD PROVIDED!
I'm letting myself use %>%
now. I get 16 elements per record where OP
gets 18 because the actual XML posted contains no evidence of
HT_CAPS_IE
and HT_IE
. Given the way we're doing this now, it's more
about computation on lists than XML, which seems unavoidable. The link
between keys and data is more based on adjacency than structure.
library(magrittr)
library(xml2)
## ugly workaround: xml2 does not seem to ignore insignificant whitespace?
x <- "so.xml" %>%
scan(what = character(), sep = "\n", strip.white = TRUE) %>%
paste0(collapse = "") %>%
read_xml
## isolate each record
(records <- x %>%
xml_children() %>%
xml_children())
#> {xml_nodeset (2)}
#> [1] <dict>\n <key>80211D_IE</key>\n <dict>\n <key>IE_KEY_80211D_CHA ...
#> [2] <dict>\n <key>80211D_IE</key>\n <dict>\n <key>IE_KEY_80211D_CHA ...
## turn each record into a list
records_list <- records %>% lapply(as_list)
str(records_list, max.level = 1)
#> List of 2
#> $ :List of 32
#> $ :List of 32
## IRL here's where I check that ...
## we have key, THINGY, key, THINGY, etc. within each record
## we have THINGY1, THINGY2, etc. across all records
## store item names from record 1
keys <- records_list[[1]][c(TRUE, FALSE)] %>% unlist
## isolate the data, do obvious simplifications, apply item names
jfun <- function(x) if(is.list(x) && length(x) > 1) x else unlist(x)
z <- records_list %>%
lapply(`[`, c(FALSE, TRUE)) %>%
lapply(`names<-`, keys) %>%
lapply(lapply, jfun)
## done!
str(z[[1]], max.level = 1)
#> List of 16
#> $ 80211D_IE :List of 4
#> $ AGE : chr "0"
#> $ AP_MODE : chr "2"
#> $ BEACON_INT : chr "100"
#> $ BSSID : chr "ac:5d:10:73:c3:11"
#> $ CAPABILITIES : chr "1073"
#> $ CHANNEL : chr "2"
#> $ CHANNEL_FLAGS: chr "10"
#> $ IE : chr "AAZPbGl2ZXIBCIKEiwwSlhgkAwECBwZVUyABCxswGAEAAA+sAgIAAA+sBAAPrAIBAAAPrAIAAN0aAFDyAQEAAFDyAgIAAFDyBABQ8gIBAABQ8gIqAQAyBDBIYGw="
#> $ NOISE : chr "0"
#> $ RATES :List of 12
#> $ RSN_IE :List of 8
#> $ RSSI : chr "-74"
#> $ SSID : chr "T2xpdmVy"
#> $ SSID_STR : chr "Oliver"
#> $ WPA_IE :List of 8
Upvotes: 1
Reputation: 2601
I stored OP's XML in a file but duplicated the single record that was provided!
This could be slicker using some additional add-on packages (I would use dplyr
and the %>%
), but I held back. I do advise using xml2
instead of XML
. You can use XPATH expressions to target the nodes of interest.
x <- read_xml("so.xml")
(elements <- xml_find_all(x, ".//dict/dict/array/dict"))
#> {xml_nodeset (2)}
#> [1] <dict>\n <key>IE_KEY_80211D_FIRST_CHANNEL</key>\n ...
#> [2] <dict>\n <key>IE_KEY_80211D_FIRST_CHANNEL</key>\n ...
## isolate the key nodes ... will become variable names
keys <- lapply(elements, xml_find_all, "key")
keys <- lapply(keys, xml_text)
## I advise checking that keys are uniform across the records here!
(keys <- keys[[1]])
#> [1] "IE_KEY_80211D_FIRST_CHANNEL" "IE_KEY_80211D_MAX_POWER"
#> [3] "IE_KEY_80211D_NUM_CHANNELS"
## isolate integer data
integers <- lapply(y, xml_find_all, "integer")
integers <- lapply(integers, xml_text)
integers <- lapply(integers, type.convert)
yay <- as.data.frame(do.call(rbind, integers))
names(yay) <- keys
yay
#> IE_KEY_80211D_FIRST_CHANNEL IE_KEY_80211D_MAX_POWER
#> 1 1 27
#> 2 1 27
#> IE_KEY_80211D_NUM_CHANNELS
#> 1 11
#> 2 11
Upvotes: 3