Re-l
Re-l

Reputation: 301

XML parsing of attributes using R

I've been trying to get into R and thought the best way is to come up with a project that I like and dig into it. So I wanted to analyze my texting habits. I managed to export my texts as an XML file in the following format:

<all>
    <message date="1423813836987" number="+15555555" type="1" read="1" locked="1" seen="1">content of text</message>
    <message date="1423813836987" number="+15555555" type="1" read="1" locked="1" seen="1">another content of text</message>
</all>

Now, what I would like to do is to extract the attributes "date" and "number" and the content of each message and create a data frame. My end-goal is to create a graph for each "number" and see how often I text that number.

After looking around, I found the XML package for R. I can extract the content of the message, but have not been able to get the attributes from the single message tag. Everything I found regarding attributes talked about nested tags like:

<message>
    <date>1423813836987</date>
    <number>555-555</number>
</message>

Would anybody point me to the right direction? Are there better way to do something like this? What I have so far is this:

doc = xmlRoot(xmlTreeParse("~/Desktop/data.xml"))
xml_data <- xmlToList(doc)

But it makes the attributes look funky.

How the data looks like in RStudio

Thank you in advance.

Upvotes: 1

Views: 93

Answers (2)

hrbrmstr
hrbrmstr

Reputation: 78792

Or a little more succinctly using the Hadleyverse:

library(xml2)
library(dplyr)

# I modified the second record to ensure it was obvious the values carried through

doc <- read_xml('<all>
    <message date="1423813836987" number="+15555555" type="1" read="1" locked="1" seen="1">content of text</message>
    <message date="1010191919191" number="+22828282" type="2" read="3" locked="0" seen="1">another content of text</message>
</all>')

msgs <- xml_find_all(doc, "//message")

bind_cols(bind_rows(lapply(xml_attrs(msgs), as.list)),
          data_frame(message=xml_text(msgs)))

## Source: local data frame [2 x 7]
## 
##            date    number  type  read locked  seen                 message
##           (chr)     (chr) (chr) (chr)  (chr) (chr)                   (chr)
## 1 1423813836987 +15555555     1     1      1     1         content of text
## 2 1010191919191 +22828282     2     3      0     1 another content of text

If you don't want to use dplyr:

cbind.data.frame(do.call(rbind, xml_attrs(msgs)),
                 messages=xml_text(msgs),
                 stringsAsFactors=FALSE)

##            date    number type read locked seen                messages
## 1 1423813836987 +15555555    1    1      1    1         content of text
## 2 1010191919191 +22828282    2    3      0    1 another content of text

It's also pretty easy to do it with the XML package, too:

bind_cols(bind_rows(lapply(xpathApply(doc, "//message", xmlAttrs), as.list)),
          data_frame(message=xpathSApply(doc, "//message", xmlValue)))

Lather. Rinse. Repeat base example if you want to stick with base.

If there are unneeded columns just use data frame column slicing to get rid of them.

If you want automatic type conversion you can take a look at type.convert but the number field is going to be problematic if you need the +.

Upvotes: 1

agstudy
agstudy

Reputation: 121568

Here an option using xpathSApply:

## create an xml Doc, replace the text by your file name
xx <- htmlParse('
<all>
    <message date="1423813836987" number="+15555555" type="1" read="1" locked="1" seen="1">content of text</message>
    <message date="1423813836987" number="+15555555" type="1" read="1" locked="1" seen="1">another content of text</message>
</all>',asText=T)
## parsing 
data.frame(
  date=xpathSApply(xx,'//all/message',xmlGetAttr,'date'),
  number=xpathSApply(xx,'//all/message',xmlGetAttr,'number'),
  message=xpathSApply(xx,'//all/message',xmlValue))

##           date    number                 message
## 1 1423813836987 +15555555         content of text
## 2 1423813836987 +15555555 another content of text

Upvotes: 1

Related Questions