Reputation: 67
I'd like to scrape https://www.deutsche-biographie.de/ using rvest
. In the input field on the top of this webpage, a name must be entered. The corresponding search results then show all people that have this or a similar name.
For example, I entered the name 'Meier' and scraped the corresponding search results using the following code.
library(rvest)
library(dplyr)
page = read_html(x = "https://www.deutsche-biographie.de/search?_csrf=45b6ee54-385e-4777-90bf-9067923e6a00&name=meier")
name = page %>% html_nodes(".media-heading a") %>% html_text()
information = page %>% html_nodes("div.media-body p") %>% html_text()
result = data.frame(name, information)
result$information <- result$information %>% trimws() %>% strsplit(split = ", \\n") %>% lapply(trimws)
result <- tidyr::unnest_wider(result, information) %>%
rename(years = 2, profession = 3) %>%
tidyr::extract(years, into = c("year_of_birth", "year_of_death"), regex = "^.*?(\\d{4}).*?\\-\\s(\\d{4})")
places <- page %>% html_nodes("li.treffer-liste-elem") %>% html_attr("data-orte") %>% strsplit(";")
result$place_of_birth <- lapply(places, function(x) x[grepl("@geburt$", x)]) %>% unlist()
result$place_of_death <- lapply(places, function(x) x[grepl("@tod$", x)]) %>% unlist()
result$place_of_activity <- lapply(places, function(x) x[grepl("@wirk$", x)])
result <- result %>%
tidyr::extract(place_of_birth, into = c("place_of_birth", "place_of_birth_coord"), regex = "^(.*?)@(.*?)@.*$") %>%
tidyr::extract(place_of_death, into = c("place_of_death", "place_of_death_coord"), regex = "^(.*?)@(.*?)@.*$")
result
The URL used here is "https://www.deutsche-biographie.de/search?_csrf=45b6ee54-385e-4777-90bf-9067923e6a00&name=meier"
with name=meier
being the name that I manually entered. Is there a way to access all the names/search results without having to specify only one certain name?
I am very grateful for any hint you may have!
Update Solution: As suggested by @QHarr, I inserted a for-loop, looping over all pages by
for (page_result in seq( from = 1, to = 2369 )) {
link = paste0("https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=",
page_result)
...}
So the entire code is as follows
result_total = data.frame()
for (page_result in seq( from = 1, to = 2369 )) {
link = paste0("https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=",
page_result)
download.file(link, destfile = "scrapedpage.html", quiet=TRUE)
#page = read_html("scrapedpage.html") #to prevent 'error in open.connection(x, "rb") : Timeout was reached'
page = read_html(link)
name = page %>% html_nodes(".media-heading a") %>% html_text()
information = page %>% html_nodes("div.media-body p") %>% html_text()
result = data.frame(name, information)
result$information <- result$information %>% trimws() %>% strsplit(split = ", \\n") %>% lapply(trimws)
result <- tidyr::unnest_wider(result, information) %>%
rename(years = 2, profession = 3) %>%
tidyr::extract(years, into = c("year_of_birth", "year_of_death"), regex = "^.*?(\\d{4}).*?\\-\\s(\\d{4})")
places <- page %>% html_nodes("li.treffer-liste-elem") %>% html_attr("data-orte") %>% strsplit(";")
result$place_of_birth <- lapply(places, function(x) x[grepl("@geburt$", x)])
result$place_of_death <- lapply(places, function(x) x[grepl("@tod$", x)])
result$place_of_activity <- lapply(places, function(x) x[grepl("@wirk$", x)])
result <- result %>%
tidyr::extract(place_of_birth, into = c("place_of_birth", "place_of_birth_coord"), regex = "^(.*?)@(.*?)@.*$") %>%
tidyr::extract(place_of_death, into = c("place_of_death", "place_of_death_coord"), regex = "^(.*?)@(.*?)@.*$")
print(paste("Page:", page_result)) #track the page that R is currently looping over
result_total <- rbind(result_total, result)
}
result_total <- apply(result_total,2,as.character)
Upvotes: 0
Views: 360
Reputation: 84465
Use the "*" operator for all. You will still need to retrieve results by page however
https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*
You can get the total results count from the initial request, then, given results are in batches of 10, and that the pagination is reflected in the url, issue requests for all the pages needed to return the total in batches of 10. A single page looks like:
Page 1:
https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=0
....
Page 11:
https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=10
Issue requests in parallel and gather results. Consider polite wait times depending on total number of requests required.
Upvotes: 1