Reputation: 1
I have no idea how to start so I have no code that I tried and I apologize...Is there a way to loop the following url by a sequence of number (year):
where 2021 is replace by a sequence and just get the simple number of search results by year?
Thank you so much!
Edit:
This works for Google search but not for Google Scholar...Generates an empty set.
ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"
url <- "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C22&as_ylo=2021&q=%22causal+inference%22+AND+%22statistics%22&btnG="
doc <- htmlTreeParse(getURL(url, httpheader = list(`User-Agent` = ua)), useInternalNodes = TRUE)
nodes <- getNodeSet(doc, "//div[@id='result-stats']")
nodes
Upvotes: 0
Views: 664
Reputation: 84465
There is an approximate results count below the search bar. A lot of the attribute values look dynamic so I would look for a relationship between more stable elements and attributes (based on experience). In this case, I would use :contains() to look for the text with "results" in a div. I would anchor this div by a css selector list that references the expected div location with respect to the search bar and the elements in between.
library(rvest)
library(httr)
headers = c("User-Agent" = "Safari/537.36")
r <- httr::GET(url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C22&as_ylo=2021&q=%22TERM1%22+AND+%22TERM2%22&btnG=",
httr::add_headers(.headers=headers))
r |> content() |> html_element('form[method=post] + div div > div:contains("results")') |> html_text()
You can then perhaps using a simple regex to extract the result count e.g.
library(stringr)
r |>
content() |>
html_element('form[method=post] + div div > div:contains("results")') |>
html_text() |>
str_extract("(\\d+)") |>
as.integer()
Upvotes: 1