R_Beginner98
R_Beginner98

Reputation: 35

Web Scraping with Rvest: set missing entries to NA

I'm an absolute R beginner and I've been trying to scrape shoe prices from this Sprinter Sports page, with the ultimate goal of having a dataset that will automatically load, on a daily basis, (i) original and (ii) discounted prices for shoes I'm interested in.

The problem is that, of the 24 shoes currently for sale, only 16 have both an "original" and "discounted" price. The remaining 8 don't have a "discounted" price as they are not being sold at a discount. Since the "original" column has 24 observations, and the "discounted" column only has 16, I can't join these together in a dataset.

How can I load shoes without a discount such that their "discounted" column is set to NA? My code is below. Thanks!

date_today = substring(gsub("-", "", Sys.Date()),3)

page_sp_merrel <- read_html("https://www.sprintersports.com/pt/sapatilhas-merrell-homem?page=1&per_page=50")

  price_old_sp_merrel <- page_sp_merrel %>%
    html_nodes(".product-card__info-price-old") %>%
    html_text()
  
  price_new_sp_merrel <- page_sp_merrel %>%
    html_nodes(".product-card__info-price-actual") %>%
    html_text()
  
  product_name_sp_merrel <- page_sp_merrel %>%
    html_nodes(".col-md-3 .product-card__info-name") %>%
    html_text()
  
  sp_merrel_df <- tibble(
    price_old = price_old_sp_merrel,
    price_new = price_new_sp_merrel,
    product_name = product_name_sp_merrel,
    date = date_today
      )

Upvotes: 2

Views: 62

Answers (1)

stefan
stefan

Reputation: 124213

This could be achieved like so. Basically my approach differs from yours in that I loop over the cards and extract the desired information directly into a dataframe which automatically gives an NA if an element is not present on a card:

library(rvest)

date_today = substring(gsub("-", "", Sys.Date()),3)

page_sp_merrel <- read_html("https://www.sprintersports.com/pt/sapatilhas-merrell-homem?page=1&per_page=50")

sp_merrel_df <- page_sp_merrel %>% 
  html_nodes(".product-card__info-data") %>% 
  purrr::map_df(function(x) {
    data.frame(
      product_name = html_node(x, ".product-card__info-name") %>% html_text(),
      price_old = html_node(x, ".product-card__info-price-old") %>% html_text(),
      price_new = html_node(x, ".product-card__info-price-actual") %>% html_text(),
      date = date_today
    )
  })

head(sp_merrel_df)
#>                  product_name price_old price_new   date
#> 1          Merrell Riverbed 3   69,99 €   59,99 € 210719
#> 2 Sapatilhas Montanha Merrell      <NA>  114,99 € 210719
#> 3      Merrell Moab Adventure      <NA>   99,99 € 210719
#> 4          Merrel Moab 2 Vent   99,99 €   79,99 € 210719
#> 5          Merrell Alverstone      <NA>   79,99 € 210719
#> 6           Merrell Chameleon      <NA>  129,99 € 210719

Upvotes: 1

Related Questions