Scraping https website using getURL

Question

I had a nice little package to scrape Google Ngram data but I have discovered they have switched to SSL and my package has broken. If I switch from readLines to getURL gets some of the way there, but some of the included script in the page is missing. Do I need to get fancy with user agents or something?

Here is what I have tried so far (pretty basic):

library(RCurl)
myurl <- "https://books.google.com/ngrams/graph?content=hacker&year_start=1950&year_end=2000"
getURL(myurl)

Comparing the results to viewing the source after entering the url in a browser shows that the crucial content is missing from the results returned to R. In the browser, the source includes content looking like this:

juba · Accepted Answer

Sorry, not a direct solution, but it doesn't seem to be an user-agent problem. When you open your URL in a browser, you can see that there is a redirection that adds a parameter at the end of the address : direct_url=t1%3B%2Chacker%3B%2Cc0.

If you use getURL() to download this new URL, complete with the new parameter, then the javascript you are mentioning is present in the result.

Another solution could be to try to access data via Google BigQuery, as mentioned in this SO question :

Google N-Gram Web API

Scraping https website using getURL

Answers (1)

Related Questions