Reputation: 21
I'm trying to scrape data from a website, let me explain the idea. On this URL: https://www.whosampled.com/Daft-Punk/sampled/?role=1 there are several songs by Daft Punk. The information I need is within each individual track, and I need to extract it for all the tracks. For example, the first one is "Harder, Better, Faster, Stronger."
https://www.whosampled.com/Daft-Punk/Harder,-Better,-Faster,-Stronger/sampled/
In each track, there is information about who has sampled Daft Punk's track: artist, the track where the sample was used, date, type. I need this information for each track produced by Daft Punk. So far, I have produced this code; there are comments on various steps (please note that some are just alternative pieces of code).
library(tidyverse)
library(rvest)
library(httr)
#setwd for useragents, i will upload on web when i will finish
setwd("C:/Users/c_ans/Desktop/bologna lezioni/comunication of statistics")
ualist<-read.table("useragents.txt", sep = "\n")
#create a list of links with all the possible songs
#page <- read_html("https://www.whosampled.com/Daft-Punk/sampled/?role=1", user_agent=sample(ualist))
#create a list of links with all the possible songs
page <- read_html("https://www.whosampled.com/Daft-Punk/sampled/?role=1")
tracks_links <- page %>% html_nodes(".trackCover") %>%
html_attr("href") %>% paste0("https://www.whosampled.com", .,"sampled/") #href is used to extract the url for each song
#first method
#loop for each link
for (i in 1:length(tracks_links)){
Sys.sleep(5)
page <- read_html(tracks_links[i], user_agent=sample(ualist))
pages <- page %>%
html_elements(".page a") %>%
html_text2() %>%
last()
#extract the data from each page
paste0(tracks_links[i], "?cp=", 1:pages) %>%
#map(~read_html(.,user_agent=sample(ualist))) #i think the problem is here, i would like to apply an user agent each time the loop is executed
map(read_html) %>%
#map_dfr(~ html_elements(.x, ".table.tdata tbody tr") %>%
map_dfr(~ html_elements(.x, ".table.tdata tbody tr") %>%
map_dfr(~ tibble(
title = html_element(.x, ".trackName.playIcon") %>%
html_text2(),
artist = html_element(.x, ".tdata__td3") %>%
html_text2(),
year = html_element(.x, ".tdata__td3:nth-child(4)") %>%
html_text2(),
genre = html_element(.x, ".tdata__badge") %>%
html_text2()
))) #correct way to store the tibble?
}
The problems are as follows:
I encounter the error Error in open.connection(x, "rb") : HTTP error 403.
when I run the loop. I believe this is due to the fact that a request for acceptance is made to the site every time the loop runs, but the site has some blocking mechanism for scraping. That's why I have set a user agent; I know for sure that it occasionally works because I managed to extract the URLs of the songs in the first part of the code.
I'm not sure if the loop is correct since it doesn't complete. However, the code inside should be fine because I tested it with a single URL instead of a list, and I managed to extract what I needed. I'm not convinced about the following part though.
paste0(tracks_links[i], "?cp=", 1:pages) %>% map(~read_html(.,user_agent=sample(ualist)))
I would like the html function to be applied to each individual link in the loop. However, I'm not sure if I've written the code correctly, and I wouldn't want that to be the issue.
In summary, every time I run the loop, I encounter the 403 error, which prevents me from understanding where the issue in the code might be.
Upvotes: 1
Views: 199
Reputation: 24089
Here is a short readable solution. Create a list of urls, loop through each page, generate a list of results, merge into 1 final answer.
This will page through the first 5 sites
library(dplyr)
library(rvest)
baseURL <- "https://www.whosampled.com/Daft-Punk/Harder,-Better,-Faster,-Stronger/sampled/?cp="
urls <- paste0(baseURL, 1:5)
#loop through the urls
#output is a list a data frams
dfs <- lapply(urls, function(url){
#read the page
page <- read_html(url)
Sys.sleep(1) #be polite while scraping!
#get parent
tablebody <- html_elements(page, ".table.tdata tbody tr")
#get desired children nodes
title = tablebody %>% html_element( ".trackName.playIcon") %>% html_text2()
artist = tablebody %>% html_element( ".tdata__td3") %>% html_text2()
year = tablebody %>% html_element( ".tdata__td3:nth-child(4)") %>% html_text2()
genre = tablebody %>% html_element(".tdata__badge") %>% html_text2()
data.frame(title, artist, year, genre)
})
#combine all of the dataframes into 1
answer <- bind_rows(dfs)
Upvotes: 0