Reputation: 890
In R I would like to get the top 10 search terms from Google Trends for a given category. For example the top 10 search terms for category autmotive are included in this url:
url <- "https://www.google.com/trends/explore#cat=0-47&geo=US&cmpt=q&tz=Etc%2FGMT-1"
To retrieve the search terms I tried the following:
library("rvest")
top_searches <- url %>%
read_html() %>%
html_nodes(xpath='//*[@class="trends-bar-chart-name"]') %>%
html_table()
This code, however, yields an empty list (note that I use Selectorgadget to figure out 'xpath').
Upvotes: 3
Views: 696
Reputation: 4275
This is what you need:
library("rvest")
url <- 'http://www.google.com/trends/fetchComponent?hl=pl&cat=0-47&geo=US&cmpt=q&tz=Etc/GMT-1&tz=Etc/GMT-1&content=1&cid=TOP_ENTITIES_0_0&export=5&w=300&h=420'
top_searches <- url %>%
read_html() %>%
html_nodes(xpath='//*[@class="trends-bar-chart-name"]') %>%
html_text(trim=TRUE)
# [1] "Car - Transportation mode" "Sales - Industry"
# [3] "Chevrolet - Automobile Company" "Ford - Automobile Make"
# [5] "Tire - Industry" "Craigslist Inc. - Advertising company"
# [7] "Truck - Truck" "Engine - Literature Subject"
# [9] "Kelley Blue Book - Company" "Toyota - Automobile Make"
Read on if you are interested why your approach didn't work and how I managed to solve that issue.
The problem is that what you are looking for is not in xml_document
object. Data you want is loaded dynamically and rvest
is not able to cope with that - it can only fetch website source code and retrieve anything that is there, without any client-side processing. As author of rvest
stated, in cases like this you must "reverse engineer the communications protocol and request the raw data directly from the server" or "use a package like RSelenium to automate a web browser".
Fortunately, the first solution proved to be relatively easy.
On Google website that you have linked to, right below chart that you were interested in, there is that small icon: </>
. Clicking it gives you HTML snippet that can be used to embed that chart on your own website.
This snippet basically executes JavaScript code that creates <iframe>
element displaying content of http://www.google.com/trends/...&export=5&w=300&h=420. As it turns out, this website contains data that you request.
However, you should realize that Google decided to publish only first HTML snippet and you should be fully aware of consequences of that.
First, there are no promises further down the road. This HTML under </>
icon will keep working until Google decide to shut down Trends embedding, because they must support sites that decided to use this snippet and forget about entire thing. But content of script that is called, URL of embedded HTML page or HTML structure might change whenever Google feels like it. Code above might stop working tomorrow.
Second, Google decided that they don't want people to call this URL directly. You can do it, although common courtesy says you shouldn't. If you decide to do it anyway, you should not abuse it. It's anyone's guess what counts as "abuse".
Back to the R code, I called html_text()
function instead of html_table()
. That is because html_nodes()
returns list of <span>
elements, not <table>
element.
Upvotes: 5