Reputation: 2206

Parsing XML file with known structure and repeating elements

I'm trying to parse information from a XML file that contains a lot of elements with repeating names.

Here is an example of the type of file I am trying to parse, containing only one record:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<array>
    <!--
        Start of the FIRST record.
    -->
    <dict>
        <key>80211D_IE</key>
        <dict>
            <key>IE_KEY_80211D_CHAN_INFO_ARRAY</key>
            <array>
                <dict>
                    <key>IE_KEY_80211D_FIRST_CHANNEL</key>
                    <integer>1</integer>
                    <key>IE_KEY_80211D_MAX_POWER</key>
                    <integer>27</integer>
                    <key>IE_KEY_80211D_NUM_CHANNELS</key>
                    <integer>11</integer>
                </dict>
            </array>
            <key>IE_KEY_80211D_COUNTRY_CODE</key>
            <string>US</string>
        </dict>
        <key>AGE</key>
        <integer>0</integer>
        <key>AP_MODE</key>
        <integer>2</integer>
        <key>BEACON_INT</key>
        <integer>100</integer>
        <key>BSSID</key>
        <string>ac:5d:10:73:c3:11</string>
        <key>CAPABILITIES</key>
        <integer>1073</integer>
        <key>CHANNEL</key>
        <integer>2</integer>
        <key>CHANNEL_FLAGS</key>
        <integer>10</integer>
        <key>IE</key>
        <data>
        AAZPbGl2ZXIBCIKEiwwSlhgkAwECBwZVUyABCxswGAEAAA+sAgIAAA+sBAAP
        rAIBAAAPrAIAAN0aAFDyAQEAAFDyAgIAAFDyBABQ8gIBAABQ8gIqAQAyBDBI
        YGw=
        </data>
        <key>NOISE</key>
        <integer>0</integer>
        <key>RATES</key>
        <array>
            <integer>1</integer>
            <integer>2</integer>
            <integer>5</integer>
            <integer>6</integer>
            <integer>9</integer>
            <integer>11</integer>
            <integer>12</integer>
            <integer>18</integer>
            <integer>24</integer>
            <integer>36</integer>
            <integer>48</integer>
            <integer>54</integer>
        </array>
        <key>RSN_IE</key>
        <dict>
            <key>IE_KEY_RSN_AUTHSELS</key>
            <array>
                <integer>2</integer>
            </array>
            <key>IE_KEY_RSN_MCIPHER</key>
            <integer>2</integer>
            <key>IE_KEY_RSN_UCIPHERS</key>
            <array>
                <integer>4</integer>
                <integer>2</integer>
            </array>
            <key>IE_KEY_RSN_VERSION</key>
            <integer>1</integer>
        </dict>
        <key>RSSI</key>
        <integer>-74</integer>
        <key>SSID</key>
        <data>
        T2xpdmVy
        </data>
        <key>SSID_STR</key>
        <string>Oliver</string>
        <key>WPA_IE</key>
        <dict>
            <key>IE_KEY_WPA_AUTHSELS</key>
            <array>
                <integer>2</integer>
            </array>
            <key>IE_KEY_WPA_MCIPHER</key>
            <integer>2</integer>
            <key>IE_KEY_WPA_UCIPHERS</key>
            <array>
                <integer>4</integer>
                <integer>2</integer>
            </array>
            <key>IE_KEY_WPA_VERSION</key>
            <integer>1</integer>
        </dict>
    </dict>
    <!--
        End of the FIRST record.
        In reality, more records follow.
    -->
</array>
</plist>

The problem that I have is that I want to essentially flatten each observation (only one observation in the example above) so that each element within the first <array> (i.e. each <dict> within <array>) is a row in a data frame and each element within that <dict> is a column, named by the appropriate <key>.

I have experimented with functions in the XML package, mostly xmlToList, but haven't quite been able to figure out the correct way of parsing XML data.

Edit:

My desired output is to more-or-less flatten each record to a row in a data frame or in a list so that values can be easily accessed by key. I don't necessarily care about keeping any hierarchical structure, for example every record will have <key>80211D_IE</key> followed by a dict that contains actual information -- the <key>80211D_IE</key> isn't necessary because it doesn't contain any real information but is just an unnecessary grouping on a set of items. I'm fine either storing this as a list, something like mydata$record1$X80211D_IE$I.E._KEY_80211D_CHAN_INFO_ARRAY$IE_KEY_80211D_FIRST_CHANNEL, or in a data frame like mydata[1, 'I.E._KEY_80211D_FIRST_CHANNEL'].

The biggest problem I have right now is that this XML structure does not seem to lend itself to parsing very well. For example if I want to subset the XML to records where the SSID_STR matches a string, I can't just use xmlToList because it doesn't know that keys should associated with their values. So I get a list like this:

> str(xmlToList("path/to/my/file.xml"), max.level=2)
List of 2
 $ array :List of 25
  ..$ dict:List of 36
  ..$ dict:List of 32
  ..$ dict:List of 32
  ..$ dict:List of 38
  ..$ dict:List of 36
  ..$ dict:List of 34
  ..$ dict:List of 34
  ..$ dict:List of 34
  ..$ dict:List of 34
  ..$ dict:List of 34
  ..$ dict:List of 32
  ..$ dict:List of 38
  ..$ dict:List of 38
  ..$ dict:List of 34
  ..$ dict:List of 36
  ..$ dict:List of 34
  ..$ dict:List of 36
  ..$ dict:List of 34
  ..$ dict:List of 36
  ..$ dict:List of 36
  ..$ dict:List of 40
  ..$ dict:List of 42
  ..$ dict:List of 36
  ..$ dict:List of 38
  ..$ dict:List of 38
 $ .attrs: Named chr "1.0"
  ..- attr(*, "names")= chr "version"

and looking at a single one of these

> str(xmlToList("path/to/my/file.xml")$array[[1]], max.level = 1)
List of 36
 $ key    : chr "80211D_IE"
 $ dict   :List of 4
 $ key    : chr "AGE"
 $ integer: chr "0"
 $ key    : chr "AP_MODE"
 $ integer: chr "2"
 $ key    : chr "BEACON_INT"
 $ integer: chr "100"
 $ key    : chr "BSSID"
 $ string : chr "a:18:a:31:0:83"
 $ key    : chr "CAPABILITIES"
 $ integer: chr "4145"
 $ key    : chr "CHANNEL"
 $ integer: chr "11"
 $ key    : chr "CHANNEL_FLAGS"
 $ integer: chr "10"
 $ key    : chr "HT_CAPS_IE"
 $ dict   :List of 12
 $ key    : chr "HT_IE"
 $ dict   :List of 34
 $ key    : chr "IE"
 $ data   : chr "\n\t\tAAR0ZXN0AQiWlgwSGCQwSAMBCwcGVVMgAQseKgEDMBgBAAAPrAICAAAPrAQA\n\t\tD6wCAQAAD6wCAAAyAmBsRgVzwAEAADMCDAstGowRG///AAAAAAAAAAA"| __truncated__
 $ key    : chr "NOISE"
 $ integer: chr "0"
 $ key    : chr "RATES"
 $ array  :List of 9
 $ key    : chr "RSN_IE"
 $ dict   :List of 8
 $ key    : chr "RSSI"
 $ integer: chr "-86"
 $ key    : chr "SSID"
 $ data   : chr "\n\t\tdGVzdA==\n\t\t"
 $ key    : chr "SSID_STR"
 $ string : chr "test"
 $ key    : chr "WPA_IE"
 $ dict   :List of 8

it is easy to see that there are really only 18 items but the keys are stored as their own items (making 36).

The xmlToList function actually does almost what I want it do to -- but rather to name the elements of the list containing the data using the value of the corresponding key.

This would look like:

List of 18
 $ AGE          : chr "0"
 $ AP_MODE      : chr "2"
 $ BEACON_INT   : chr "100"
 $ BSSID        : chr "a:18:a:31:0:83"
 $ CAPABILITIES : chr "4145"
 $ CHANNEL      : chr "11"
 $ CHANNEL_FLAGS: chr "10"
 $ HT_CAPS_IE   :List of 12
 $ HT_IE        :List of 34
 $ IE           : chr "\n\t\tAAR0ZXN0AQiWlgwSGCQwSAMBCwcGVVMgAQseKgEDMBgBAAAPrAICAAAPrAQA\n\t\tD6wCAQAAD6wCAAAyAmBsRgVzwAEAADMCDAstGowRG///AAAAAAAAAAA"| __truncated__
 $ NOISE        : chr "0"
 $ RATES        :List of 9
 $ RSN_IE       :List of 8
 $ RSSI         : chr "-86"
 $ SSID         : chr "\n\t\tdGVzdA==\n\t\t"
 $ SSID_STR     : chr "test"
 $ WPA_IE       :List of 8
 $ X80211D_IE   :List of 4

In this hypothetical output, it would be easy to get values using the appropriate key. Further it would be easy to continue to unnest lists (since the grouping structure is unnecessary) to produce a data frame.

Upvotes: 0

Answers (2)

jennybryan

Reputation: 2601

New answer after the significant edit to the question.

I stored OP's XML in a file BUT DUPLICATED THE SINGLE RECORD PROVIDED! I'm letting myself use %>% now. I get 16 elements per record where OP gets 18 because the actual XML posted contains no evidence of HT_CAPS_IE and HT_IE. Given the way we're doing this now, it's more about computation on lists than XML, which seems unavoidable. The link between keys and data is more based on adjacency than structure.

library(magrittr)
library(xml2)

## ugly workaround: xml2 does not seem to ignore insignificant whitespace?
x <- "so.xml" %>%
  scan(what = character(), sep = "\n", strip.white = TRUE) %>%
  paste0(collapse = "") %>% 
  read_xml

## isolate each record
(records <- x %>%
  xml_children() %>%
  xml_children())
#> {xml_nodeset (2)}
#> [1] <dict>\n  <key>80211D_IE</key>\n  <dict>\n    <key>IE_KEY_80211D_CHA ...
#> [2] <dict>\n  <key>80211D_IE</key>\n  <dict>\n    <key>IE_KEY_80211D_CHA ...

## turn each record into a list
records_list <- records %>% lapply(as_list)
str(records_list, max.level = 1)
#> List of 2
#>  $ :List of 32
#>  $ :List of 32

## IRL here's where I check that ...
##  we have key, THINGY, key, THINGY, etc. within each record
##  we have THINGY1, THINGY2, etc. across all records

## store item names from record 1
keys <- records_list[[1]][c(TRUE, FALSE)] %>% unlist

## isolate the data, do obvious simplifications, apply item names
jfun <- function(x) if(is.list(x) && length(x) > 1) x else unlist(x)
z <- records_list %>%
  lapply(`[`, c(FALSE, TRUE)) %>% 
  lapply(`names<-`, keys) %>% 
  lapply(lapply, jfun)

## done!
str(z[[1]], max.level = 1)
#> List of 16
#>  $ 80211D_IE    :List of 4
#>  $ AGE          : chr "0"
#>  $ AP_MODE      : chr "2"
#>  $ BEACON_INT   : chr "100"
#>  $ BSSID        : chr "ac:5d:10:73:c3:11"
#>  $ CAPABILITIES : chr "1073"
#>  $ CHANNEL      : chr "2"
#>  $ CHANNEL_FLAGS: chr "10"
#>  $ IE           : chr "AAZPbGl2ZXIBCIKEiwwSlhgkAwECBwZVUyABCxswGAEAAA+sAgIAAA+sBAAPrAIBAAAPrAIAAN0aAFDyAQEAAFDyAgIAAFDyBABQ8gIBAABQ8gIqAQAyBDBIYGw="
#>  $ NOISE        : chr "0"
#>  $ RATES        :List of 12
#>  $ RSN_IE       :List of 8
#>  $ RSSI         : chr "-74"
#>  $ SSID         : chr "T2xpdmVy"
#>  $ SSID_STR     : chr "Oliver"
#>  $ WPA_IE       :List of 8

Upvotes: 1

jennybryan

Reputation: 2601

I stored OP's XML in a file but duplicated the single record that was provided!

This could be slicker using some additional add-on packages (I would use dplyr and the %>%), but I held back. I do advise using xml2 instead of XML. You can use XPATH expressions to target the nodes of interest.

x <- read_xml("so.xml")
(elements <- xml_find_all(x, ".//dict/dict/array/dict"))
#> {xml_nodeset (2)}
#> [1] <dict>\n                    <key>IE_KEY_80211D_FIRST_CHANNEL</key>\n ...
#> [2] <dict>\n                    <key>IE_KEY_80211D_FIRST_CHANNEL</key>\n ...

## isolate the key nodes ... will become variable names
keys <- lapply(elements, xml_find_all, "key")
keys <- lapply(keys, xml_text)
## I advise checking that keys are uniform across the records here!
(keys <- keys[[1]])
#> [1] "IE_KEY_80211D_FIRST_CHANNEL" "IE_KEY_80211D_MAX_POWER"    
#> [3] "IE_KEY_80211D_NUM_CHANNELS"

## isolate integer data
integers <- lapply(y, xml_find_all, "integer")
integers <- lapply(integers, xml_text)
integers <- lapply(integers, type.convert)
yay <- as.data.frame(do.call(rbind, integers))
names(yay) <- keys
yay
#>   IE_KEY_80211D_FIRST_CHANNEL IE_KEY_80211D_MAX_POWER
#> 1                           1                      27
#> 2                           1                      27
#>   IE_KEY_80211D_NUM_CHANNELS
#> 1                         11
#> 2                         11

Upvotes: 3

Parsing XML file with known structure and repeating elements

Answers (2)

Related Questions