LinuxNewbie
LinuxNewbie

Reputation: 23

Webscraping in R where input is required

I have used the rvest package in R to scrape unique URLs before.

However, I am now stuck with a particular website. The URL stays static and I need to select the following dropdowns now and scrape the resulting table that appears.

Will be helpful if someone can guide me on what direction to take with websites like these? Is R even capable of doing this?

Edit: I have researched and it seems RSelenium can handle such tasks. Unfortunately, I have no exposure to it. Can someone recommend an example/blog/material online on using Selenium specifically for clicking and scraping for someone as noob as I am?

Upvotes: 1

Views: 191

Answers (1)

Guillaume
Guillaume

Reputation: 681

I have made a blog post about an RSelenium example: https://guillaumepressiat.github.io/blog/2021/04/RSelenium-paginated-tables

this website contains a lot of things about selenium, you will have to plug it to RSelenium api package.(verbs are almost the same in all languages, findElement, etc) https://www.guru99.com/selenium-tutorial.html

But as an example based on your question maybe something like this to begin:

# https://stackoverflow.com/q/67021563/10527496


# java -jar selenium-server-standalone-3.9.1.jar 

library(RSelenium)
library(tidyverse)
library(rvest)
library(httr)

remDr <- remoteDriver(
    remoteServerAddr = "localhost",
    port = 4444L, # change port according to terminal 
    browserName = "firefox"
)

remDr$open()
# remDr$getStatus()
url <- "https://fcainfoweb.nic.in/reports/Report_Menu_Web.aspx"
remDr$navigate(url)

Sys.sleep(5)

# first : radio buttons
u1 <- remDr$findElement(using = "id", value = 'ctl00_MainContent_Rbl_Rpt_type_0')
u2 <- remDr$findElement(using = "id", value = 'ctl00_MainContent_Rbl_Rpt_type_1')
u3 <- remDr$findElement(using = "id", value = 'ctl00_MainContent_Rbl_Rpt_type_2')
u4 <- remDr$findElement(using = "id", value = 'ctl00_MainContent_Rbl_Rpt_type_3')

dynam <- remDr$mouseMoveToLocation(webElement = u1)
u1$click()


Sys.sleep(5)

# second : Select input
s1 <- remDr$findElement(using = "id", value = 'ctl00_MainContent_Ddl_Rpt_Option0')

# get available choices 
s_choices <- read_html(s1$getElementAttribute('innerHTML')[[1]]) %>% 
    html_nodes('option') %>% 
    html_attrs() %>% 
    unlist() %>% 
    .[3:length(.)] %>% 
    as.vector()
dynam <- remDr$mouseMoveToLocation(webElement = s1)
s1$click()
s1$sendKeysToElement(sendKeys = list(s_choices[1], key = "enter"))
# s_choices[1] is "Daily Prices"

Sys.sleep(5)

# get date choices
s_date_choices <- remDr$findElement(using = "id", value = "ctl00_MainContent_Txt_FrmDate")
dynam <- remDr$mouseMoveToLocation(webElement = s_date_choices)
s_date_choices$click()

s_date_choices$sendKeysToElement(sendKeys = list('01/01/2021', key = "enter"))

Sys.sleep(5)

s_table <- remDr$findElement(using = "id", value = "Panel1")

# get first tables as an example
results_1 <- read_html(s_table$getElementAttribute('innerHTML')[[1]]) %>% 
    html_table(fill = TRUE) %>% 
    .[2:length(.)]


we get a list of three tables as a result:

enter image description here

enter image description here

enter image description here

enter image description here

Making a function from this code to loop on a date vector is possible after that I think (you will have to reload a fresh start page on base URL for each date I suppose).

Upvotes: 1

Related Questions