Reputation: 1453
I want to scrape a table from Yahoo Finance and download it as a dataframe.
Unfortunately I don't really know how to do it using the rvest
-package.
Here is a first approach:
library(tidyverse)
library(rvest)
url<-"https://finance.yahoo.com/calendar/ipo?from=2021-02-21&to=2021-02-27&day=2021-02-23"
url %>%
html() %>%
html_nodes(xpath="table") %>%
html_table()
As expected, the code does not work. Can someone help me?
I want to have the framed table as a dataframe:
Many thanks in advance!
Upvotes: 0
Views: 377
Reputation: 119
Here is the simplest way of solving your problem and it keeps the headers too :)
library(tidyverse)
library(rvest)
url<-"https://finance.yahoo.com/calendar/ipo?from=2021-02-21&to=2021-02-27&day=2021-02-23"
# Scrape the data
df <- url %>%
read_html() %>%
html_nodes(xpath = '//*[@id="cal-res-table"]') %>%
as.character() %>%
XML::readHTMLTable()
# df is a list of two tables (as you can see from the website) - pick only the first list item
tbl <- as.data.frame(df[1])
# print your table
tbl
#> NULL..Symbol. NULL.Company NULL.Exchange
#> 1 VELOU Velocity Acquisition Corp. Units Nasdaq
#> 2 FTAAU FTAC Athena Acquisition Corp. Unit Nasdaq
#> 3 CMIIU CM Life Sciences II Inc. Unit Nasdaq
#> 4 Metropress Ltd LSE
#> 5 CTWO.P.V County Capital 2 Ltd TSXV
#> 6 GSEVU Gores Holdings VII, Inc. Units Nasdaq
#> 7 NVOS Novo Integrated Sciences, Inc. Common Stock Nasdaq
#> 8 SLAMU Slam Corp. Unit Nasdaq
#> NULL.Date NULL.Price.Range NULL.Price NULL.Currency NULL.Shares
#> 1 Feb 23, 2021 10.00 - 10.00 - USD -
#> 2 Feb 23, 2021 - - USD -
#> 3 Feb 23, 2021 10.00 - 10.00 - USD -
#> 4 Feb 01, 2021 - 6 GBP 45452752
#> 5 Nov 19, 2020 0.08 - 0.08 0.1 CAD 6000000
#> 6 Feb 23, 2021 - - USD -
#> 7 Feb 23, 2021 - - USD -
#> 8 Feb 23, 2021 10.00 - 10.00 - USD -
#> NULL.Actions
#> 1 Expected
#> 2 Expected
#> 3 Expected
#> 4 Priced
#> 5 Priced
#> 6 Expected
#> 7 Expected
#> 8 Expected
You might want to clean up those column names, though. :)
Upvotes: 1
Reputation: 388982
Unfortunately, the table is not easily extractable using html_table
. Here's a way to extract the individual values from the table and doing some post-processing to get the data in a dataframe.
library(rvest)
url<-"https://finance.yahoo.com/calendar/ipo?from=2021-02-21&to=2021-02-27&day=2021-02-23"
url %>%
read_html() %>%
html_nodes('table') %>%
.[[1]] -> tab1
header <- tab1 %>% html_nodes('th') %>% html_text()
result <- tab1%>%
html_nodes('tr.simpTblRow td') %>%
html_text() %>%
matrix(ncol = 9, byrow = TRUE) %>%
as.data.frame()
names(result) <- header
result
# Symbol Company Exchange
#1 VELOU Velocity Acquisition Corp. Units Nasdaq
#2 FTAAU FTAC Athena Acquisition Corp. Unit Nasdaq
#3 CMIIU CM Life Sciences II Inc. Unit Nasdaq
#4 Metropress Ltd LSE
#5 CTWO.P.V County Capital 2 Ltd TSXV
#6 GSEVU Gores Holdings VII, Inc. Units Nasdaq
#7 NVOS Novo Integrated Sciences, Inc. Common Stock Nasdaq
#8 SLAMU Slam Corp. Unit Nasdaq
# Date Price Range Price Currency Shares Actions
#1 Feb 23, 2021 10.00 - 10.00 - USD - Expected
#2 Feb 23, 2021 - - USD - Expected
#3 Feb 23, 2021 10.00 - 10.00 - USD - Expected
#4 Feb 01, 2021 - 6 GBP 45452752 Priced
#5 Nov 19, 2020 0.08 - 0.08 0.1 CAD 6000000 Priced
#6 Feb 23, 2021 - - USD - Expected
#7 Feb 23, 2021 - - USD - Expected
#8 Feb 23, 2021 10.00 - 10.00 - USD - Expected
Upvotes: 1