Reputation: 25
I scraped a website already and made a dataframe from it that only contains one column. The dataframe is called "urldataframe", while the column that contains all of the urls is called "individualrace_url".
Here is some of my data, the links are formatted as a character currently in the dataframe, urldataframe.
Here is the first two links: https://sites.google.com/site/hscrrarchive/arca-1/2007-arca-re-max-series-season/2007-arca-200 https://sites.google.com/site/hscrrarchive/arca-1/2007-arca-re-max-series-season/2007-construct-corps-palm-beach-grading-250
How do I create a scraper that goes through my dataframe of links one by one? I'm not sure if a for loop is the way to go about this or not. If I can use a for loop, what am I doing wrong?
res_all <- NULL
for (realtest.Event in racename) {
urlrunning = paste0(realtest.Event)
scrapinghere = read_html(urlrunning)
putithere <- tibble(
bubba = scrapinghere %>% html_nodes("#sites-canvas-main-content td:nth-child(2)") %>% html_text(),
bubba2 = scrapinghere %>% html_node("#sites-canvas-main-content td:nth-child(1)")) %>% html_text()
res_all <- bind_rows(res_all, putithere)
}
I'm hoping that it would go through the loop of each url that I have in the dataframe. Every url has the same nodes, I'm pretty sure my issue is setting up the loop itself.
Upvotes: 1
Views: 298
Reputation: 6563
Scraping the tables from the two links without loop.
library(tidyverse)
library(rvest)
library(janitor)
df <- tibble(
links = c("https://sites.google.com/site/hscrrarchive/arca-1/2007-arca-re-max-series-season/2007-arca-200",
"https://sites.google.com/site/hscrrarchive/arca-1/2007-arca-re-max-series-season/2007-construct-corps-palm-beach-grading-250")
)
get_ARCA <- function(link) {
link %>%
read_html() %>%
html_table() %>%
pluck(4) %>%
row_to_names(1) %>%
clean_names()
}
map_dfr(df$links, get_ARCA)
# A tibble: 81 × 9
finish start car_number driver sponsor make laps led status
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 2 5 Bobby Gerhart Lucas Oil Chevrolet 80 54 Running
2 2 20 93 Marc Mitchell Ergon Pontiac 80 5 Running
3 3 12 3 Jeremy Clements Harrison's Work Wear-1 Stop Conv-Saxon Chevrolet 80 0 Running
4 4 13 39 David Ragan AAA Ford 80 0 Running
5 5 3 46 Frank Kimmel Tri-State Motorsports-Pork Ford 80 0 Running
6 6 19 31 Timothy Peters Cometic Gaskets-Okuma Chevrolet 80 0 Running
7 7 31 16 Justin Allgaier AG Tech-Trashman-Hoosier Tire Midwest Chevrolet 80 0 Running
8 8 14 4 Scott Lagasse Jr. Cunningham Motorsports Dodge 80 0 Running
9 9 11 47 Phillip McGilton SI Performance-Gould's Electric Chevrolet 80 0 Running
10 10 17 2 Michael McDowell Hillcrest Capital Partners Dodge 80 0 Running
# … with 71 more rows
# ℹ Use `print(n = ...)` to see more rows
Other variations:
#1 Separate in a list
res_separate <- map(df$links, get_ARCA)
#2 As data frames inside of the original data frame
res_in_df <- df %>%
mutate(content = map(links, get_ARCA))
Upvotes: 2
Reputation: 10365
A for
loop is ok, I think in your case the closing parentheses for the tibble
are at the wrong place. Another pattern I like is to use purrr::map_dfr
which returns a data.frame. Here my untested code as no data is provided:
library(purrr)
res_all <- set_names(racename) %>%
map_dfr(function(realtest.Event) {
scrapinghere = read_html(realtest.Event)
tibble(
bubba = scrapinghere %>% html_nodes("#sites-canvas-main-content td:nth-child(2)") %>% html_text(),
bubba2 = scrapinghere %>% html_node("#sites-canvas-main-content td:nth-child(1)") %>% html_text()
)
}, .id = "racename")
I've used the .id
argument to provide an additional column to the returned data.frame with the value of realtest.Event
so that you know to which url the results belong to.
Upvotes: 1