Reputation: 959
I would like R to take a word in a column in a dataset, and return a value from a website. The code I have so far is below. So, for each word in the data frame column, it will go to the website and return the pronunciation (for example, the pronunciation on http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=word&stress=-s is "W ER1 D"). I have looked at the HTML of the website, and it's unclear what I would need to enter to return this value - it's between <tt>
and </tt>
but there are many of these. I'm also not sure how to then get that value into R. Thank you.
library(xml2)
for (word in df$word) {
result <- read_html("http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in="word"&stress=-s")
}
Upvotes: 2
Views: 12167
Reputation: 251
Parsing HTML is a tricky task in R. There are a couple ways though. If the HTML converts well to XML and the website/API always returns the same structure then you can use tools to parse XML. Otherwise you could use regex and call stringr::str_extract()
on the HTML.
For your case, it is fairly easy to get the value you're looking for using XML tools. It's true that there are a lot of <tt>
tags but the one you want is always in the second instance so you can just pull out that one.
#load packages. dplyr is just to use the pipe %>% function
library(httr)
library(XML)
library(dplyr)
#test words
wordlist = c('happy', 'sad')
for (word in wordlist){
#build the url and GET the result
url <- paste0("http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=",word,"&stress=-s")
h <- handle(url)
res <- GET(handle = h)
#parse the HTML
resXML <- htmlParse(content(res, as = "text"))
#retrieve second <tt>
print(getNodeSet(resXML, '//tt[2]') %>% sapply(., xmlValue))
#don't abuse your API
Sys.sleep(0.1)
}
>[1] "HH AE1 P IY0 ."
>[1] "S AE1 D ."
Good luck!
EDIT: This code will return a dataframe:
#load packages. dplyr is just to use the pipe %>% function
library(httr)
library(XML)
library(dplyr)
#test words
wordlist = c('happy', 'sad')
#initializae the dataframe with pronunciation field
pronunciation_list <- data.frame(pronunciation=character(),stringsAsFactors = F)
#loop over the words
for (word in wordlist){
#build the url and GET the result
url <- paste0("http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=",word,"&stress=-s")
h <- handle(url)
res <- GET(handle = h)
#parse the HTML
resXML <- htmlParse(content(res, as = "text"))
#retrieve second <tt>
to_add <- data.frame(pronunciation=(getNodeSet(resXML, '//tt[2]') %>% sapply(., xmlValue)))
#bind the data
pronunciation_list<- rbind(pronunciation_list, to_add)
#don't abuse your API
Sys.sleep(0.1)
}
Upvotes: 4