Reputation: 13
I am trying to scrape fundamentals data table (pe ratio, pb ratio and dividend yield) from nse website (link). I tried the following from rvest package:
url = "https://www1.nseindia.com/products/content/equities/indices/historical_pepb.htm"
pgsession <-html_session(url)
But, I receive this error:
Error in curl::curl_fetch_memory(url, handle = handle) :
LibreSSL SSL_read: SSL_ERROR_SYSCALL, errno 60
Also, I tried the httr package (css selectors identified using Chrome extension 'SelectorGadget')
fd <- list(submit = "Get Data", # Not Sure if it's the correct css selector
IndexName = "NIFTY 50",
fromDate = "01-06-2020",
toDate = "15-06-2020" )
resp<-POST(url, body=fd, encode="form")
But, I receive the same error. I have scanned many forums for troubleshooting the problem but, it seems the website is blocking scraping attempts. Can someone validate this or provide a way to scraping the table from this website?
Upvotes: 0
Views: 1233
Reputation: 13
Here's a (crude) wrapper to fetch data for NIFTY 50 Fundamentals from NSE website
get.nse.ratios <- function(index.nse = 'NIFTY 50', date.start = as.Date('2001-01-01'), date.end = as.Date(Sys.time())){
# url.base <- 'https://www1.nseindia.com/products/content/equities/indices/historical_pepb.htm'
index.nse <- gsub(' ', '%20', index.nse)
# Split Date range into acceptable range
max.history.constraint <- 100
dates.start <- seq.Date(date.start, date.end, by = max.history.constraint)
data.master <- data.frame()
# Loop over sub-periods to extract data
for(fromDate in dates.start){
toDate <- min(fromDate+(max.history.constraint - 1), as.Date(Sys.Date()))
cat(sprintf('Fetching data from %s to %s \n', as.Date(fromDate), as.Date(toDate)))
# browser()
# Reformat dates
fromDate <- format.Date(as.Date(fromDate), '%d-%m-%Y')
toDate <- format.Date(as.Date(toDate), '%d-%m-%Y')
# Infer url for sub-period
url.sub <- sprintf("https://www1.nseindia.com/products/dynaContent/equities/indices/historical_pepb.jsp?indexName=%s&fromDate=%s&toDate=%s&yield1=undefined&yield2=undefined&yield3=undefined&yield4=all", index.nse, fromDate, toDate)
# Scrape table from inferred url
data.sub <- rvest::html_table(xml2::read_html(url.sub))[[1]]
# Clean the table
names.columns <- unname(unlist(data.sub[2,]))
data.clean <- data.sub[3:(nrow(data.sub)-1),]
colnames(data.clean) <- names.columns
data.clean$Date <- as.Date(data.clean$Date, format = '%d-%b-%Y')
cols.num <- names(which(sapply(data.clean, class) == 'character'))
data.clean[cols.num] <- sapply(data.clean[cols.num],as.numeric)
# Append to master data
data.master <- rbind(data.master, data.clean)
}
return(data.master)
}
Upvotes: 0
Reputation: 4658
If you right-click the page, click 'Inspect element', and go to the 'Network' tab, you can see the request being made when you click the 'Get data' button.
In this case, the request is to the below URL, which can be easily read and parsed into a data frame using for example rvest::html_table()
.
By changing the URL I'm positive you can extract the table you want.
url <- "https://www1.nseindia.com/products/dynaContent/equities/indices/historical_pepb.jsp?indexName=NIFTY%2050&fromDate=01-06-2020&toDate=02-06-2020&yield1=undefined&yield2=undefined&yield3=undefined&yield4=all"
rvest::html_table(xml2::read_html(url))[[1]]
gives
Historical NIFTY 50 P/E, P/B & Div. Yield values Historical NIFTY 50 P/E, P/B & Div. Yield values
1 For the period 01-06-2020 to 02-06-2020 For the period 01-06-2020 to 02-06-2020
2 Date P/E
3 01-Jun-2020 22.96
4 02-Jun-2020 23.31
5 Download file in csv format Download file in csv format
Historical NIFTY 50 P/E, P/B & Div. Yield values Historical NIFTY 50 P/E, P/B & Div. Yield values
1 For the period 01-06-2020 to 02-06-2020 For the period 01-06-2020 to 02-06-2020
2 P/B Div Yield
3 2.80 1.55
4 2.84 1.53
5 Download file in csv format Download file in csv format
Upvotes: 1