natalieee
natalieee

Reputation: 67

Scraping/accessing all search results from input field

I'd like to scrape https://www.deutsche-biographie.de/ using rvest. In the input field on the top of this webpage, a name must be entered. The corresponding search results then show all people that have this or a similar name.

For example, I entered the name 'Meier' and scraped the corresponding search results using the following code.

library(rvest)
library(dplyr)

page = read_html(x = "https://www.deutsche-biographie.de/search?_csrf=45b6ee54-385e-4777-90bf-9067923e6a00&name=meier")
name = page %>% html_nodes(".media-heading a") %>% html_text()
information = page %>% html_nodes("div.media-body p") %>% html_text()
result = data.frame(name, information)
result$information <- result$information %>% trimws() %>% strsplit(split = ", \\n") %>% lapply(trimws)
result <- tidyr::unnest_wider(result, information) %>%
  rename(years = 2, profession = 3) %>% 
  tidyr::extract(years, into = c("year_of_birth", "year_of_death"), regex = "^.*?(\\d{4}).*?\\-\\s(\\d{4})")

places <- page %>% html_nodes("li.treffer-liste-elem") %>% html_attr("data-orte") %>% strsplit(";")

result$place_of_birth <- lapply(places, function(x) x[grepl("@geburt$", x)]) %>% unlist()
result$place_of_death <- lapply(places, function(x) x[grepl("@tod$", x)]) %>% unlist()
result$place_of_activity <- lapply(places, function(x) x[grepl("@wirk$", x)])

result <- result %>% 
  tidyr::extract(place_of_birth, into = c("place_of_birth", "place_of_birth_coord"), regex = "^(.*?)@(.*?)@.*$") %>% 
  tidyr::extract(place_of_death, into = c("place_of_death", "place_of_death_coord"), regex = "^(.*?)@(.*?)@.*$")

result

The URL used here is "https://www.deutsche-biographie.de/search?_csrf=45b6ee54-385e-4777-90bf-9067923e6a00&name=meier" with name=meier being the name that I manually entered. Is there a way to access all the names/search results without having to specify only one certain name? I am very grateful for any hint you may have!

Update Solution: As suggested by @QHarr, I inserted a for-loop, looping over all pages by

    for (page_result in seq( from = 1, to = 2369 )) {
      link = paste0("https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=",
                    page_result)
...}

So the entire code is as follows

result_total = data.frame()

for (page_result in seq( from = 1, to = 2369 )) {
  link = paste0("https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=",
                page_result)
  
  download.file(link, destfile = "scrapedpage.html", quiet=TRUE)
  #page = read_html("scrapedpage.html") #to prevent 'error in open.connection(x, "rb") : Timeout was reached'
  page = read_html(link)
  name = page %>% html_nodes(".media-heading a") %>% html_text()
  information = page %>% html_nodes("div.media-body p") %>% html_text()
  result = data.frame(name, information)
  result$information <- result$information %>% trimws() %>% strsplit(split = ", \\n") %>% lapply(trimws)
  result <- tidyr::unnest_wider(result, information) %>%
    rename(years = 2, profession = 3) %>% 
    tidyr::extract(years, into = c("year_of_birth", "year_of_death"), regex = "^.*?(\\d{4}).*?\\-\\s(\\d{4})")
  
  places <- page %>% html_nodes("li.treffer-liste-elem") %>% html_attr("data-orte") %>% strsplit(";")
  
  result$place_of_birth <- lapply(places, function(x) x[grepl("@geburt$", x)])
  result$place_of_death <- lapply(places, function(x) x[grepl("@tod$", x)])
  result$place_of_activity <- lapply(places, function(x) x[grepl("@wirk$", x)])
  
  result <- result %>% 
    tidyr::extract(place_of_birth, into = c("place_of_birth", "place_of_birth_coord"), regex = "^(.*?)@(.*?)@.*$") %>% 
    tidyr::extract(place_of_death, into = c("place_of_death", "place_of_death_coord"), regex = "^(.*?)@(.*?)@.*$")
  
  print(paste("Page:", page_result)) #track the page that R is currently looping over
  result_total <- rbind(result_total, result)
}


result_total <- apply(result_total,2,as.character)

Upvotes: 0

Views: 360

Answers (1)

QHarr
QHarr

Reputation: 84465

Use the "*" operator for all. You will still need to retrieve results by page however

https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*

You can get the total results count from the initial request, then, given results are in batches of 10, and that the pagination is reflected in the url, issue requests for all the pages needed to return the total in batches of 10. A single page looks like:

Page 1:

https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=0

....

Page 11:

https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=10


Issue requests in parallel and gather results. Consider polite wait times depending on total number of requests required.

Upvotes: 1

Related Questions