Reputation: 301
I've been trying to get into R and thought the best way is to come up with a project that I like and dig into it. So I wanted to analyze my texting habits. I managed to export my texts as an XML file in the following format:
<all>
<message date="1423813836987" number="+15555555" type="1" read="1" locked="1" seen="1">content of text</message>
<message date="1423813836987" number="+15555555" type="1" read="1" locked="1" seen="1">another content of text</message>
</all>
Now, what I would like to do is to extract the attributes "date" and "number" and the content of each message and create a data frame. My end-goal is to create a graph for each "number" and see how often I text that number.
After looking around, I found the XML package for R. I can extract the content of the message, but have not been able to get the attributes from the single message
tag. Everything I found regarding attributes talked about nested tags like:
<message>
<date>1423813836987</date>
<number>555-555</number>
</message>
Would anybody point me to the right direction? Are there better way to do something like this? What I have so far is this:
doc = xmlRoot(xmlTreeParse("~/Desktop/data.xml"))
xml_data <- xmlToList(doc)
But it makes the attributes look funky.
Thank you in advance.
Upvotes: 1
Views: 93
Reputation: 78792
Or a little more succinctly using the Hadleyverse:
library(xml2)
library(dplyr)
# I modified the second record to ensure it was obvious the values carried through
doc <- read_xml('<all>
<message date="1423813836987" number="+15555555" type="1" read="1" locked="1" seen="1">content of text</message>
<message date="1010191919191" number="+22828282" type="2" read="3" locked="0" seen="1">another content of text</message>
</all>')
msgs <- xml_find_all(doc, "//message")
bind_cols(bind_rows(lapply(xml_attrs(msgs), as.list)),
data_frame(message=xml_text(msgs)))
## Source: local data frame [2 x 7]
##
## date number type read locked seen message
## (chr) (chr) (chr) (chr) (chr) (chr) (chr)
## 1 1423813836987 +15555555 1 1 1 1 content of text
## 2 1010191919191 +22828282 2 3 0 1 another content of text
If you don't want to use dplyr
:
cbind.data.frame(do.call(rbind, xml_attrs(msgs)),
messages=xml_text(msgs),
stringsAsFactors=FALSE)
## date number type read locked seen messages
## 1 1423813836987 +15555555 1 1 1 1 content of text
## 2 1010191919191 +22828282 2 3 0 1 another content of text
It's also pretty easy to do it with the XML
package, too:
bind_cols(bind_rows(lapply(xpathApply(doc, "//message", xmlAttrs), as.list)),
data_frame(message=xpathSApply(doc, "//message", xmlValue)))
Lather. Rinse. Repeat base example if you want to stick with base.
If there are unneeded columns just use data frame column slicing to get rid of them.
If you want automatic type conversion you can take a look at type.convert
but the number
field is going to be problematic if you need the +
.
Upvotes: 1
Reputation: 121568
Here an option using xpathSApply
:
## create an xml Doc, replace the text by your file name
xx <- htmlParse('
<all>
<message date="1423813836987" number="+15555555" type="1" read="1" locked="1" seen="1">content of text</message>
<message date="1423813836987" number="+15555555" type="1" read="1" locked="1" seen="1">another content of text</message>
</all>',asText=T)
## parsing
data.frame(
date=xpathSApply(xx,'//all/message',xmlGetAttr,'date'),
number=xpathSApply(xx,'//all/message',xmlGetAttr,'number'),
message=xpathSApply(xx,'//all/message',xmlValue))
## date number message
## 1 1423813836987 +15555555 content of text
## 2 1423813836987 +15555555 another content of text
Upvotes: 1