Xian Zhao
Xian Zhao

Reputation: 81

web scraping in r with SelectorGadget

I was running this simple code below to scrape the employee number from this Fortune 500 page. I used the Chrome's extention: SelectorGadget to identify that the number I want matches with ".info__row--7f9lE:nth-child(13) .info__value--2AHH7"

library(rvest)
library(dplyr)
#download google chrome extention: SelectorGadget
link = "https://fortune.com/company/walmart/"
page = read_html(link)
Employees = page %>% html_nodes(".info__row--7f9lE:nth-child(13) .info__value--2AHH7") %>% html_text()
Employees

However, it returned "character(0)". Does anyone know what is the cause? I feel it must be a simple mistake somewhere. Thanks in advance!

Update

Here is the code I modified based on Jon's comments.

a <- c("https://fortune.com/company/walmart/", "https://fortune.com/company/amazon-com/"              
,"https://fortune.com/company/apple/"                   
,"https://fortune.com/company/cvs-health/"              
,"https://fortune.com/company/unitedhealth-group/"      
, "https://fortune.com/company/berkshire-hathaway/"      
, "https://fortune.com/company/mckesson/"                
,"https://fortune.com/company/amerisourcebergen/"       
, "https://fortune.com/company/alphabet/"                
, "https://fortune.com/company/exxon-mobil/"             
,"https://fortune.com/company/att/"                     
,"https://fortune.com/company/costco/"                  
,"https://fortune.com/company/cigna/"                   
, "https://fortune.com/company/cardinal-health/"         
,"https://fortune.com/company/microsoft/"               
,"https://fortune.com/company/walgreens-boots-alliance/"
,"https://fortune.com/company/kroger/"                  
, "https://fortune.com/company/home-depot/"              
,"https://fortune.com/company/jpmorgan-chase/"          
,"https://fortune.com/company/verizon/"                 
,"https://fortune.com/company/ford-motor/"              
, "https://fortune.com/company/general-motors/"          
,"https://fortune.com/company/anthem/"                  
, "https://fortune.com/company/centene/"                 
,"https://fortune.com/company/fannie-mae/"              
, "https://fortune.com/company/comcast/"                 
, "https://fortune.com/company/chevron/"                 
,"https://fortune.com/company/dell-technologies/"       
,"https://fortune.com/company/bank-of-america-corp/"    
,"https://fortune.com/company/target/")


find_by_name <- function(list_data, name, elem = NULL) {
  idx <- which(sapply(list_data, \(x) x$name) == name, arr.ind = TRUE)
  stopifnot(length(idx) > 0)
  if (length(idx) > 1) { idx <- idx[1] }
  dat <- list_data[[idx]]
  if (is.null(elem)) dat else dat[[elem]]
}

numEmp <- numeric()

for (i in 1:length(a)){
  json_data <- read_html(a[i]) |>
    html_element("script#preload") |> 
    html_text() |>
    sub("\\s*window\\.__PRELOADED_STATE__ = ", "", x = _, perl = TRUE) |>
    sub(";\\s*$", "", x = _, perl = TRUE) |>
    fromJSON(simplifyVector = FALSE)
  
  
  
  temp<-gsub(".*https://fortune.com", "", a[i])
  page_data <- json_data$components$page[[temp]]
  
  info_data <- page_data |> 
    find_by_name("body", "children") |>
    find_by_name("company-about-wrapper", "children") |>
    find_by_name("company-information", "config")
  
  
  numEmp[i] <- info_data$employees # Results will be fed into this numEmp variable.
}
numEmp

An error says

Error in find_by_name(page_data, "body", "children") : length(idx) > 0 is not TRUE

Should I somehow change the code stopifnot(length(idx) > 0)?

Upvotes: 1

Views: 160

Answers (1)

Jon Manese
Jon Manese

Reputation: 371

When I do document.querySelectorAll(".info__row--7f9lE:nth-child(13) .info__value--2AHH7") I see you want to scrape the # of employees. Maurits is right, looks like the data is downloaded as (inline) JSON and then rendered later. You can use Selenium to save the rendered page then apply your CSS selector there. Or you can extract the inline JSON and scrape it from there.

After some manual work, you can do the 2nd option like below in R 4.2.x

library(rvest)
library(jsonlite)

# R 4.1.x
sub2 <- function(x, pattern, replacement) sub(pattern, replacement, x = x, perl = TRUE)

url <- "https://fortune.com/company/walmart/"
json_data <- read_html(url) |>
  html_element("script#preload") |> 
  html_text() |>
  ## sub("\\s*window\\.__PRELOADED_STATE__ = ", "", x = _, perl = TRUE) |> # R 4.2.x
  sub2("\\s*window\\.__PRELOADED_STATE__ = ", "") |>                       # R 4.1.x
  ## sub(";\\s*$", "", x = _, perl = TRUE) |>  # R 4.2.x
  sub2(";\\s*$", "") |>                        # R 4.1.x
  fromJSON(simplifyVector = FALSE)

page_data <- json_data$components$page[["/company/walmart/"]]

find_by_name <- function(list_data, name, elem = NULL) {
  idx <- which(sapply(list_data, \(x) x$name) == name, arr.ind = TRUE)
  stopifnot(length(idx) > 0)
  if (length(idx) > 1) { idx <- idx[1] }
  dat <- list_data[[idx]]
  if (is.null(elem)) dat else dat[[elem]]
}

info_data <- page_data |> 
  find_by_name("body", "children") |>
  find_by_name("company-about-wrapper", "children") |>
  find_by_name("company-information", "config")

info_data$employees
#> [1] "2300000"

# Extra code to scrape company-data-table segments
library(purrr)
data_tables <- page_data |>
  find_by_name("body", "children") |>
  find_by_name("company-about-wrapper", "children") |>
  find_by_name("company-table-wrapper", "children")

rows <- data_tables |>
  lapply(\(x) c(x$config$data, x$config$change)) |>
  purrr::flatten() |>
  discard(~ is.null(.$key))

df <- data.frame(
  key = rows |> map_chr(~ .$key),
  title = rows |> map_chr(~ .$fieldMeta$title),
  type = rows |> map_chr(~ .$fieldMeta$type),
  value = rows |> map_chr(~ .$value)
)

Upvotes: 3

Related Questions