Lisa
Lisa

Reputation: 959

Read HTML into R

I would like R to take a word in a column in a dataset, and return a value from a website. The code I have so far is below. So, for each word in the data frame column, it will go to the website and return the pronunciation (for example, the pronunciation on http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=word&stress=-s is "W ER1 D"). I have looked at the HTML of the website, and it's unclear what I would need to enter to return this value - it's between <tt> and </tt> but there are many of these. I'm also not sure how to then get that value into R. Thank you.

library(xml2)

for (word in df$word) {
  result <- read_html("http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in="word"&stress=-s")
}

Upvotes: 2

Views: 12167

Answers (1)

Scientist_jake
Scientist_jake

Reputation: 251

Parsing HTML is a tricky task in R. There are a couple ways though. If the HTML converts well to XML and the website/API always returns the same structure then you can use tools to parse XML. Otherwise you could use regex and call stringr::str_extract() on the HTML.

For your case, it is fairly easy to get the value you're looking for using XML tools. It's true that there are a lot of <tt> tags but the one you want is always in the second instance so you can just pull out that one.

#load packages. dplyr is just to use the pipe %>% function
library(httr)
library(XML)
library(dplyr)

#test words
wordlist = c('happy', 'sad')

for (word in wordlist){
#build the url and GET the result
url <- paste0("http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=",word,"&stress=-s")
h <- handle(url)
res <- GET(handle = h)

#parse the HTML
resXML <- htmlParse(content(res, as = "text"))

#retrieve second <tt>
print(getNodeSet(resXML, '//tt[2]') %>% sapply(., xmlValue))
#don't abuse your API
Sys.sleep(0.1)
}

>[1] "HH AE1 P IY0 ."
>[1] "S AE1 D ."

Good luck!

EDIT: This code will return a dataframe:

#load packages. dplyr is just to use the pipe %>% function
library(httr)
library(XML)
library(dplyr)

#test words
wordlist = c('happy', 'sad')

#initializae the dataframe with pronunciation field
pronunciation_list <- data.frame(pronunciation=character(),stringsAsFactors = F)

#loop over the words
for (word in wordlist){
  #build the url and GET the result
  url <- paste0("http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=",word,"&stress=-s")
  h <- handle(url)
  res <- GET(handle = h)

  #parse the HTML
  resXML <- htmlParse(content(res, as = "text"))

  #retrieve second <tt>
  to_add <- data.frame(pronunciation=(getNodeSet(resXML, '//tt[2]') %>% sapply(., xmlValue)))

  #bind the data
  pronunciation_list<- rbind(pronunciation_list, to_add)

  #don't abuse your API
  Sys.sleep(0.1)
}

Upvotes: 4

Related Questions