Justas Mundeikis
Justas Mundeikis

Reputation: 995

R extracting structured data between multiple html tags

I have downloaded my facebook data. It contains a htm file with all my contacts. I would like to read it in with R, and create a contact.csv.

The usual structure is:

<tr><td>Firstname Lastname</td><td><span class="meta"><ul><li>contact: [email protected]</li><li>contact: +123456789</li></ul></span></td></tr>

but some contacts may miss the phone number

<tr><td>Firstname Lastname</td><td><span class="meta"><ul><li>contact: [email protected]</li></ul></span></td></tr>

while some miss the email

<tr><td>Firstname Lastname</td><td><span class="meta"><ul><li>contact: +123456789</li></ul></span></td></tr>

The csv should have the structure Firstname Lastname; email; tel number

I have tried:

library(rvest)
library(stringr)

html <- read_html("contact_info.htm")
p_nodes <- html %>% html_nodes('tr')
p_nodes_text <- p_nodes %>% html_text()
write.csv(p_nodes_text, "contact.csv")

Which creates me the csv, but unfortunately merges names with "contact:" and does not create separate columns and does not allow to have "NA" for missing either phone numbers or emails.

How could I enhance my code to accomplish this? Thanks

Upvotes: 2

Views: 60

Answers (1)

Guillaume Ottavianoni
Guillaume Ottavianoni

Reputation: 496

You can use regexpr to identify the email & the telephon number :

xml1 <- '<tr><td>Firstname Lastname</td><td><span class="meta"><ul><li>contact: [email protected]</li><li>contact: +123456789</li></ul></span></td></tr>'
xml2 <- '<tr><td>Firstname Lastname</td><td><span class="meta"><ul><li>contact: [email protected]</li></ul></span></td></tr>'
xml3 <- '<tr><td>Firstname Lastname</td><td><span class="meta"><ul><li>contact: +123456789</li></ul></span></td></tr>'
docs <- c(xml1,xml2,xml3)

library(rvest)

df <- NULL

for ( doc in docs) {
 page <- read_html(doc)
 name <- page %>% html_nodes("tr td:first-child") %>% html_text()
 meta <- page %>% html_nodes("span.meta li") %>% html_text
 ind_mail <- grep(".{1,}\\@.{1,}\\..{1,}",meta)
 if(length(ind_mail)>0) mail <- meta[ind_mail] else mail <- "UNKWN"
 ind_tel <- grep("[0-9]{6,}$",meta)
 if(length(ind_tel)>0) tel <- meta[ind_tel] else tel <- "UNKWN"
 res <- cbind(name,mail,tel)
 df <- rbind(df,res)
}

Hope that will helps ,

Gottavianoni

Upvotes: 1

Related Questions