Reputation: 995
I have downloaded my facebook data. It contains a htm file with all my contacts. I would like to read it in with R, and create a contact.csv.
The usual structure is:
<tr><td>Firstname Lastname</td><td><span class="meta"><ul><li>contact: [email protected]</li><li>contact: +123456789</li></ul></span></td></tr>
but some contacts may miss the phone number
<tr><td>Firstname Lastname</td><td><span class="meta"><ul><li>contact: [email protected]</li></ul></span></td></tr>
while some miss the email
<tr><td>Firstname Lastname</td><td><span class="meta"><ul><li>contact: +123456789</li></ul></span></td></tr>
The csv should have the structure Firstname Lastname; email; tel number
I have tried:
library(rvest)
library(stringr)
html <- read_html("contact_info.htm")
p_nodes <- html %>% html_nodes('tr')
p_nodes_text <- p_nodes %>% html_text()
write.csv(p_nodes_text, "contact.csv")
Which creates me the csv, but unfortunately merges names with "contact:" and does not create separate columns and does not allow to have "NA" for missing either phone numbers or emails.
How could I enhance my code to accomplish this? Thanks
Upvotes: 2
Views: 60
Reputation: 496
You can use regexpr to identify the email & the telephon number :
xml1 <- '<tr><td>Firstname Lastname</td><td><span class="meta"><ul><li>contact: [email protected]</li><li>contact: +123456789</li></ul></span></td></tr>'
xml2 <- '<tr><td>Firstname Lastname</td><td><span class="meta"><ul><li>contact: [email protected]</li></ul></span></td></tr>'
xml3 <- '<tr><td>Firstname Lastname</td><td><span class="meta"><ul><li>contact: +123456789</li></ul></span></td></tr>'
docs <- c(xml1,xml2,xml3)
library(rvest)
df <- NULL
for ( doc in docs) {
page <- read_html(doc)
name <- page %>% html_nodes("tr td:first-child") %>% html_text()
meta <- page %>% html_nodes("span.meta li") %>% html_text
ind_mail <- grep(".{1,}\\@.{1,}\\..{1,}",meta)
if(length(ind_mail)>0) mail <- meta[ind_mail] else mail <- "UNKWN"
ind_tel <- grep("[0-9]{6,}$",meta)
if(length(ind_tel)>0) tel <- meta[ind_tel] else tel <- "UNKWN"
res <- cbind(name,mail,tel)
df <- rbind(df,res)
}
Hope that will helps ,
Gottavianoni
Upvotes: 1