Michael
Michael

Reputation: 57

Scraping tbody class object in R

I am completely new to web scraping with R and I would like to scrape the following table (image) that behaves as tbody. If I run the following code, I see only headlines, without the data (Website in Czech).

I should be getting the time, price, volume and volume in CZK for placed orders there.

library(rvest)
library(dplyr)

PSE_Page <- "https://www.pse.cz/detail/CZ0003519753?tab=detail-trading-data" 
Page <- read_html(PSE_Page)

Our_table <- Page %>% 
  rvest::html_nodes('body') %>% 
  xml2::xml_find_all("//div[contains(@class, 'stock-table large-table small-text page-block-negative-margin table-container js-swipe-icon')]") %>% 
  rvest::html_text()

Our_table

Output: 1 "\n Čas\n Cena\n Celkový objem\n Celkový objem\n **

Can somebody help? Thanks a lot!!!

enter image description here

Upvotes: 0

Views: 1049

Answers (2)

Caesarius
Caesarius

Reputation: 109

library(tidyverse) library(rvest)

for this xpath='html/body/div[1]/table/tbody/tr'

I just skip the tbody from the xpath

header <- html_elements(xpath='/html/body/div[1]/table') header \n {xml_nodeset (1)} <table><tr><td width="100%"> <div id="logo"> <table width="100%"><tr><td valign="top"> <a href="https://www.m "

space space space

header<- html_elements(xpath='/html/body/div[1]/table/tr')

header \n {xml_nodeset (1)} <tr/><td width="100%"> <div id="logo"> <table width="100%"><tr>\n<td valign="top"> <a href="https://www.maizegdb ."

Upvotes: 0

satesrah
satesrah

Reputation: 118

The table you're referring to is not a static table. It is dynamic, since you can iteract with it, e.g. sorting the table. So you can't scrape the information with rvest. I'm really no expert in dynamic web scraping, but this code snippet extracts the data. I use a web browser via the RSelenium package that can be controlled from within R to receive the dynamic content of that table. There are probably much better solutions out there to do this job, though.

library(RSelenium)
library(dplyr)

rD <- rsDriver(browser = "firefox", port = 8787L)
remDr <- rD$client
remDr$navigate("https://www.pse.cz/detail/CZ0003519753?tab=detail-trading-data")
page <- XML::htmlParse(remDr$getPageSource()[[1]])

remDr$close()

header <- XML::xpathSApply(page, "/html/body/div[8]/div[2]/div/div[2]/div[3]/div/div/table/thead", XML::xmlValue)
table <- XML::xpathSApply(page, "/html/body/div[8]/div[2]/div/div[2]/div[3]/div/div/table/tbody", XML::xmlValue)

header <- read.table(text=header, sep = "\n", strip.white = T) %>% unlist %>% as.character()
body <- read.table(text=table, sep = "\n", strip.white = T) 
header[3] <- "Total Turnover pcs"
header[4] <- "Total Turnover CZK"

data.frame(lapply(split(body$V1, paste(header)), as.character))

#     Price     Time Total.Turnover.CZK Total.Turnover.pcs
# 1 95,00 % 12:00:25     CZK 780,333.33        800,000 pcs
# 2 95,00 % 12:00:08     CZK 292,625.00        300,000 pcs
# 3 95,00 % 12:00:08     CZK 195,083.33        200,000 pcs


Upvotes: 1

Related Questions