ChinookJargon
ChinookJargon

Reputation: 99

Combing webscraping with a for loop

I have been scraping and combining multiple datasets, however, I feel that my approach could be more efficient and am currently stuck.
My original approach was to download each dataset individually and use rbind() to make one large dataset:

library(tidyverse)
library(rvest)

uber <- read_html("http://h1bdata.info/index.php?em=Uber&job=&city=&year=All+Years") %>%
  html_node("#myTable") %>%
  html_table()

airbnb <- read_html("http://h1bdata.info/index.php?em=Airbnb&job=&city=&year=All+Years") %>%
  html_node("#myTable") %>%
  html_table()

rbind(uber, airbnb)

However, over ten datasets this can be tedious and inefficient. So I tried writing a loop, so far I have the following:

#I created a list of the tech companies for my loop index
tech.companies <-as.list(c("Airbnb", "Amazon", "Apple", "Facebook", "Google", "Linkedin", "Microsoft", "Twitter", "Uber", "Yahoo"))

#I then create the loop.  This has been my "best" attempt
for(i in 1:length(tech.companies)) {
  url <- paste0("http://h1bdata.info/index.php?em=", i, "&job=&city=&year=All+Years")
  tble <- read_html(url) %>%
    html_node("#myTable") %>%
    html_table()
  }

However, I am not really understanding what my loop is doing. What is the output? How do I store the results of each loop in a new variable? Is it possible to use rbind() to combine the datasets within the loop itself? I read a lot that apply functions are better to use in R than loops, is this still the case in a situation like this?

Any insight you could share would be greatly appreciated.

Upvotes: 0

Views: 259

Answers (1)

TomS
TomS

Reputation: 226

Generally, I have the impression loops are avoided in R where possible. In this case, however, I prefer to use loops, but the reason might be I started coding in Matlab (where loops are much more common). If you insisted on apply functions (I cannot provide performance benchmarks), it is possible to predefine e.g. all URLs / parameters you'd generate in each loop iteration beforehand in a list, write the scraping function and run [s]apply[list, myfun()]

To store results of a loop, you can either use lists or rbind with one temp var per loop and one "final" var where you rbind all temp vars iteration by iteration.

In your case I'd go for a list as you obviously want one table per employer. Rbinding row by row (e.g. if you try to scrape article prices item by item) is not necessary in this case I think (open for discussion)

the loop itself works a bit different from what you seem to expect.

The iteration parameter i returns only a number, not the char value from your vector. So "i in 1:length(x)" returns a vector 1:10 (10 employers) meaning that the url you create in the loop with url <- paste0()doesn ot insert the company name, but just the iteration number (1:10). In the code below you see how you can get the correct url using "tech.companies[i]" retrieving the ith element of the vector "tech.companies"

tech.companies <- c("Airbnb", "Amazon", "Apple", "Facebook", "Google", "Linkedin", "Microsoft", "Twitter", "Uber", "Yahoo")
result <- list() # init list (so R knows you store tble in a list)

for(i in 1:length(tech.companies)) {
   url <- paste0("http://h1bdata.info/index.php?em=", tech.companies[i], "&job=&city=&year=All+Years")
   tble <- read_html(url) %>%
      html_node("#myTable") %>%
      html_table()

   result[[i]] <- tble 
}

You can then access the result data.frames e.g. with the following code:

Airbnb <- result[[1]] # result is a data.frame

Upvotes: 1

Related Questions