user113156
user113156

Reputation: 7127

Using dplyr to collect data and bind rows of the data collected

I am trying to use rvest() to extract some information. What I have is a list of links and I would like to bind the rows of the data collected together.

What I currently have is the following;

EDIT: heres the links without the weekend data

    links <- c("https://finance.yahoo.com/calendar/ipo?day=2018-03-05", "https://finance.yahoo.com/calendar/ipo?day=2018-03-06", 
"https://finance.yahoo.com/calendar/ipo?day=2018-03-07", "https://finance.yahoo.com/calendar/ipo?day=2018-03-08", 
"https://finance.yahoo.com/calendar/ipo?day=2018-03-09", "https://finance.yahoo.com/calendar/ipo?day=2018-03-12", 
"https://finance.yahoo.com/calendar/ipo?day=2018-03-13", "https://finance.yahoo.com/calendar/ipo?day=2018-03-14", 
"https://finance.yahoo.com/calendar/ipo?day=2018-03-15", "https://finance.yahoo.com/calendar/ipo?day=2018-03-16", 
"https://finance.yahoo.com/calendar/ipo?day=2018-03-19", "https://finance.yahoo.com/calendar/ipo?day=2018-03-20", 
"https://finance.yahoo.com/calendar/ipo?day=2018-03-21", "https://finance.yahoo.com/calendar/ipo?day=2018-03-22", 
"https://finance.yahoo.com/calendar/ipo?day=2018-03-23", "https://finance.yahoo.com/calendar/ipo?day=2018-03-26", 
"https://finance.yahoo.com/calendar/ipo?day=2018-03-27", "https://finance.yahoo.com/calendar/ipo?day=2018-03-28", 
"https://finance.yahoo.com/calendar/ipo?day=2018-03-29", "https://finance.yahoo.com/calendar/ipo?day=2018-03-30", 
"https://finance.yahoo.com/calendar/ipo?day=2018-04-02", "https://finance.yahoo.com/calendar/ipo?day=2018-04-03", 
"https://finance.yahoo.com/calendar/ipo?day=2018-04-04", "https://finance.yahoo.com/calendar/ipo?day=2018-04-05", 
"https://finance.yahoo.com/calendar/ipo?day=2018-04-06", "https://finance.yahoo.com/calendar/ipo?day=2018-04-09", 
"https://finance.yahoo.com/calendar/ipo?day=2018-04-10", "https://finance.yahoo.com/calendar/ipo?day=2018-04-11", 
"https://finance.yahoo.com/calendar/ipo?day=2018-04-12", "https://finance.yahoo.com/calendar/ipo?day=2018-04-13", 
"https://finance.yahoo.com/calendar/ipo?day=2018-04-16", "https://finance.yahoo.com/calendar/ipo?day=2018-04-17", 
"https://finance.yahoo.com/calendar/ipo?day=2018-04-18", "https://finance.yahoo.com/calendar/ipo?day=2018-04-19", 
"https://finance.yahoo.com/calendar/ipo?day=2018-04-20", "https://finance.yahoo.com/calendar/ipo?day=2018-04-23", 
"https://finance.yahoo.com/calendar/ipo?day=2018-04-24", "https://finance.yahoo.com/calendar/ipo?day=2018-04-25", 
"https://finance.yahoo.com/calendar/ipo?day=2018-04-26", "https://finance.yahoo.com/calendar/ipo?day=2018-04-27", 
"https://finance.yahoo.com/calendar/ipo?day=2018-04-30", "https://finance.yahoo.com/calendar/ipo?day=2018-05-01", 
"https://finance.yahoo.com/calendar/ipo?day=2018-05-02", "https://finance.yahoo.com/calendar/ipo?day=2018-05-03", 
"https://finance.yahoo.com/calendar/ipo?day=2018-05-04", "https://finance.yahoo.com/calendar/ipo?day=2018-05-07", 
"https://finance.yahoo.com/calendar/ipo?day=2018-05-08", "https://finance.yahoo.com/calendar/ipo?day=2018-05-09", 
"https://finance.yahoo.com/calendar/ipo?day=2018-05-10")

Code:

library(rvest)
library(dplyr)
library(magrittr)
x <- links %>% 
  read_html() %>% 
  html_table() %>%
  extract2(1) %>%
  bind_rows() %>%
  as_tibble

This gives the following error: Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) : Expecting a single string value: [type=character; extent=68].

I am able to get the code working for 1 link however when I try to get it working for all the links I am running into errors. For example this code works:

x <- "https://finance.yahoo.com/calendar/ipo?day=2018-05-08" %>% 
  read_html() %>% 
  html_table() %>%
  extract2(1) %>%
  bind_rows() %>%
  as_tibble

EDIT:

from = "2016-03-04"
to = "2018-05-10"


s <- seq(as.Date(from), as.Date(to), "days")
library(chron)
s <- s[!is.weekend(s)]  
links <- paste0("https://finance.yahoo.com/calendar/ipo?day=", s)



library(rvest)
library(dplyr)
library(magrittr)
library(purrr)
library(naniar)

IPOs <- links[1:400] %>% 
  map_dfr(~read_html(.x) %>% 
            html_table() %>%
            extract2(1) %>%
            naniar::replace_with_na_all(condition = ~.x == "-") %>%
            type.convert(as.is = TRUE) )

Upvotes: 0

Views: 324

Answers (1)

aosmith
aosmith

Reputation: 36084

It looks like you want to loop through the URL's. For each one you want to read it, parse it into a data frame, and extracting the first data frame in the list. So the read_html() through extract2() steps should be done within the loop.

One option is to use a purrr::map_dfr() loop, since it looks like you want to bind things into a single tibble in the end.

Nominally that could look like:

library(rvest)
library(dplyr)
library(magrittr)
library(purrr)

links %>% 
     map_dfr(~read_html(.x) %>% 
              html_table() %>%
              extract2(1) )

However, it turns out that you have missing values that are represented by hyphens (-). Some of the tables have these and some don't. When these are present, R reads your integer columns as characters while when they are not present integers are read as integer columns. This causes problems when binding everything together.

I did not see an argument in read_html() to deal with these directly (I was looking for the equivalent of na.strings in read.table() or na in readr::read_csv()). My work-around was to convert the hyphens to NA using function replace_with_na_all() from package naniar (see the vignette here). Then I converted all columns to the appropriate type with type.convert().

All of this was done within the map_dfr() loop.

Here is an example with just the first two URL's in links.

links[1:2] %>% 
     map_dfr(~read_html(.x) %>% 
              html_table() %>%
              extract2(1) %>%
              naniar::replace_with_na_all(condition = ~.x == "-") %>%
              type.convert(as.is = TRUE) )

# A tibble: 15 x 9
   Symbol    Company                     Exchange  Date         `Price Range` Price Currency    Shares Actions 
   <chr>     <chr>                       <chr>     <chr>        <chr>         <dbl> <chr>        <int> <chr>   
 1 2003.HK   Vcredit Hldg Ltd            HKSE      Jun 21, 2018 NA            20    HKD       49969000 Priced  
 2 2003.HK   Vcredit Hldg Ltd            HKSE      Jun 21, 2018 NA            20    HKD       11745600 Priced  
 3 2003.HK   Vcredit Hldg Ltd            HKSE      Jun 21, 2018 NA            20    HKD        6857200 Priced  
 4 0000      Vcredit Hldg Ltd            HKSE      Jun 12, 2018 NA            NA    HKD             NA Expected
 5 6571.JP   QB Net Holdings Co Ltd      Japan OTC Mar 14, 2018 21.11 - 21.11 NA    Y          9785900 Expected
 6 1621.HK   Vico Intl Hldg Ltd          HKSE      Mar 05, 2018 NA             0.35 HKD      175000000 Priced  
 7 PZM.AX    Piston Mach Ltd             ASX       Mar 05, 2018 0.32 - 0.32   NA    AU        50000000 Expected
 8 ""        Agp Ltd                     Karachi   Mar 05, 2018 0.76 - 0.76   80    PKR        8750000 Priced  
 9 GRC.L     GRC International Group PLC LSE       Mar 05, 2018 0.98 - 0.98    0.7  GBP        8414286 Priced  
10 ACPH.BR   Acacia Pharma Group PLC     Brussels  Mar 05, 2018 3.24 - 4.16    3.6  EUR        3175413 Priced  
11 ACPH.BR   Acacia Pharma Group PLC     Brussels  Mar 05, 2018 3.24 - 4.16    3.6  EUR        7935698 Priced  
12 GCI.AX    Gryphon Capital Income Tr   ASX       May 23, 2018 1.57 - 1.57    2    AUD       87650000 Priced  
13 GCI.AX    Gryphon Capital Income Tr   ASX       May 04, 2018 1.57 - 1.57   NA    AUD       50000000 Expected
14 STRL.L    Stirling Inds Plc           LSE       Mar 06, 2018 1.40 - 1.40    1    GBP        8881002 Priced  
15 541006.BO Angel Fibers Ltd            BSE       Mar 06, 2018 NA            27    INR        6408000 Priced  

Upvotes: 1

Related Questions