Using dplyr to collect data and bind rows of the data collected

Question

I am trying to use rvest() to extract some information. What I have is a list of links and I would like to bind the rows of the data collected together.

What I currently have is the following;

EDIT: heres the links without the weekend data

    links <- c("https://finance.yahoo.com/calendar/ipo?day=2018-03-05", "https://finance.yahoo.com/calendar/ipo?day=2018-03-06", 
"https://finance.yahoo.com/calendar/ipo?day=2018-03-07", "https://finance.yahoo.com/calendar/ipo?day=2018-03-08", 
"https://finance.yahoo.com/calendar/ipo?day=2018-03-09", "https://finance.yahoo.com/calendar/ipo?day=2018-03-12", 
"https://finance.yahoo.com/calendar/ipo?day=2018-03-13", "https://finance.yahoo.com/calendar/ipo?day=2018-03-14", 
"https://finance.yahoo.com/calendar/ipo?day=2018-03-15", "https://finance.yahoo.com/calendar/ipo?day=2018-03-16", 
"https://finance.yahoo.com/calendar/ipo?day=2018-03-19", "https://finance.yahoo.com/calendar/ipo?day=2018-03-20", 
"https://finance.yahoo.com/calendar/ipo?day=2018-03-21", "https://finance.yahoo.com/calendar/ipo?day=2018-03-22", 
"https://finance.yahoo.com/calendar/ipo?day=2018-03-23", "https://finance.yahoo.com/calendar/ipo?day=2018-03-26", 
"https://finance.yahoo.com/calendar/ipo?day=2018-03-27", "https://finance.yahoo.com/calendar/ipo?day=2018-03-28", 
"https://finance.yahoo.com/calendar/ipo?day=2018-03-29", "https://finance.yahoo.com/calendar/ipo?day=2018-03-30", 
"https://finance.yahoo.com/calendar/ipo?day=2018-04-02", "https://finance.yahoo.com/calendar/ipo?day=2018-04-03", 
"https://finance.yahoo.com/calendar/ipo?day=2018-04-04", "https://finance.yahoo.com/calendar/ipo?day=2018-04-05", 
"https://finance.yahoo.com/calendar/ipo?day=2018-04-06", "https://finance.yahoo.com/calendar/ipo?day=2018-04-09", 
"https://finance.yahoo.com/calendar/ipo?day=2018-04-10", "https://finance.yahoo.com/calendar/ipo?day=2018-04-11", 
"https://finance.yahoo.com/calendar/ipo?day=2018-04-12", "https://finance.yahoo.com/calendar/ipo?day=2018-04-13", 
"https://finance.yahoo.com/calendar/ipo?day=2018-04-16", "https://finance.yahoo.com/calendar/ipo?day=2018-04-17", 
"https://finance.yahoo.com/calendar/ipo?day=2018-04-18", "https://finance.yahoo.com/calendar/ipo?day=2018-04-19", 
"https://finance.yahoo.com/calendar/ipo?day=2018-04-20", "https://finance.yahoo.com/calendar/ipo?day=2018-04-23", 
"https://finance.yahoo.com/calendar/ipo?day=2018-04-24", "https://finance.yahoo.com/calendar/ipo?day=2018-04-25", 
"https://finance.yahoo.com/calendar/ipo?day=2018-04-26", "https://finance.yahoo.com/calendar/ipo?day=2018-04-27", 
"https://finance.yahoo.com/calendar/ipo?day=2018-04-30", "https://finance.yahoo.com/calendar/ipo?day=2018-05-01", 
"https://finance.yahoo.com/calendar/ipo?day=2018-05-02", "https://finance.yahoo.com/calendar/ipo?day=2018-05-03", 
"https://finance.yahoo.com/calendar/ipo?day=2018-05-04", "https://finance.yahoo.com/calendar/ipo?day=2018-05-07", 
"https://finance.yahoo.com/calendar/ipo?day=2018-05-08", "https://finance.yahoo.com/calendar/ipo?day=2018-05-09", 
"https://finance.yahoo.com/calendar/ipo?day=2018-05-10")

Code:

library(rvest)
library(dplyr)
library(magrittr)
x <- links %>% 
  read_html() %>% 
  html_table() %>%
  extract2(1) %>%
  bind_rows() %>%
  as_tibble

This gives the following error: Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) : Expecting a single string value: [type=character; extent=68].

I am able to get the code working for 1 link however when I try to get it working for all the links I am running into errors. For example this code works:

x <- "https://finance.yahoo.com/calendar/ipo?day=2018-05-08" %>% 
  read_html() %>% 
  html_table() %>%
  extract2(1) %>%
  bind_rows() %>%
  as_tibble

EDIT:

from = "2016-03-04"
to = "2018-05-10"


s <- seq(as.Date(from), as.Date(to), "days")
library(chron)
s <- s[!is.weekend(s)]  
links <- paste0("https://finance.yahoo.com/calendar/ipo?day=", s)



library(rvest)
library(dplyr)
library(magrittr)
library(purrr)
library(naniar)

IPOs <- links[1:400] %>% 
  map_dfr(~read_html(.x) %>% 
            html_table() %>%
            extract2(1) %>%
            naniar::replace_with_na_all(condition = ~.x == "-") %>%
            type.convert(as.is = TRUE) )

aosmith · Accepted Answer

It looks like you want to loop through the URL's. For each one you want to read it, parse it into a data frame, and extracting the first data frame in the list. So the read_html() through extract2() steps should be done within the loop.

One option is to use a purrr::map_dfr() loop, since it looks like you want to bind things into a single tibble in the end.

Nominally that could look like:

library(rvest)
library(dplyr)
library(magrittr)
library(purrr)

links %>% 
     map_dfr(~read_html(.x) %>% 
              html_table() %>%
              extract2(1) )

However, it turns out that you have missing values that are represented by hyphens (-). Some of the tables have these and some don't. When these are present, R reads your integer columns as characters while when they are not present integers are read as integer columns. This causes problems when binding everything together.

I did not see an argument in read_html() to deal with these directly (I was looking for the equivalent of na.strings in read.table() or na in readr::read_csv()). My work-around was to convert the hyphens to NA using function replace_with_na_all() from package naniar (see the vignette here). Then I converted all columns to the appropriate type with type.convert().

All of this was done within the map_dfr() loop.

Here is an example with just the first two URL's in links.

links[1:2] %>% 
     map_dfr(~read_html(.x) %>% 
              html_table() %>%
              extract2(1) %>%
              naniar::replace_with_na_all(condition = ~.x == "-") %>%
              type.convert(as.is = TRUE) )

# A tibble: 15 x 9
   Symbol    Company                     Exchange  Date         `Price Range` Price Currency    Shares Actions 
                                                                  
 1 2003.HK   Vcredit Hldg Ltd            HKSE      Jun 21, 2018 NA            20    HKD       49969000 Priced  
 2 2003.HK   Vcredit Hldg Ltd            HKSE      Jun 21, 2018 NA            20    HKD       11745600 Priced  
 3 2003.HK   Vcredit Hldg Ltd            HKSE      Jun 21, 2018 NA            20    HKD        6857200 Priced  
 4 0000      Vcredit Hldg Ltd            HKSE      Jun 12, 2018 NA            NA    HKD             NA Expected
 5 6571.JP   QB Net Holdings Co Ltd      Japan OTC Mar 14, 2018 21.11 - 21.11 NA    Y          9785900 Expected
 6 1621.HK   Vico Intl Hldg Ltd          HKSE      Mar 05, 2018 NA             0.35 HKD      175000000 Priced  
 7 PZM.AX    Piston Mach Ltd             ASX       Mar 05, 2018 0.32 - 0.32   NA    AU        50000000 Expected
 8 ""        Agp Ltd                     Karachi   Mar 05, 2018 0.76 - 0.76   80    PKR        8750000 Priced  
 9 GRC.L     GRC International Group PLC LSE       Mar 05, 2018 0.98 - 0.98    0.7  GBP        8414286 Priced  
10 ACPH.BR   Acacia Pharma Group PLC     Brussels  Mar 05, 2018 3.24 - 4.16    3.6  EUR        3175413 Priced  
11 ACPH.BR   Acacia Pharma Group PLC     Brussels  Mar 05, 2018 3.24 - 4.16    3.6  EUR        7935698 Priced  
12 GCI.AX    Gryphon Capital Income Tr   ASX       May 23, 2018 1.57 - 1.57    2    AUD       87650000 Priced  
13 GCI.AX    Gryphon Capital Income Tr   ASX       May 04, 2018 1.57 - 1.57   NA    AUD       50000000 Expected
14 STRL.L    Stirling Inds Plc           LSE       Mar 06, 2018 1.40 - 1.40    1    GBP        8881002 Priced  
15 541006.BO Angel Fibers Ltd            BSE       Mar 06, 2018 NA            27    INR        6408000 Priced

Using dplyr to collect data and bind rows of the data collected

Answers (1)

Related Questions