Reputation: 637

R Web scrape Excel spreadsheet URLs to read with openxlsx

I need to read parts of an Excel file into R. I have some existing code, but the authority changed the source. Previously, there was a direct URL to the document, now the link to the document can only be accessed through a website landing page.

Could someone tell me with which package I can achieve that? The link to the Excel file is: http://www.snamretegas.it/it/business-servizi/dati-operativi-business/8_dati_operativi_bilanciamento_sistema/ There i am looking at the document: "Dati operativi relativi al bilanciamento del sistema post Del. 312/2016/R/gas - Database 2018"

I add the previous code to give an idea what I did. As you can see, I only required read.xlsx for this first step.

Many thanks in advance!

  library(ggplot2)
  library(lubridate)
  library(openxlsx)
  library(reshape2)
  library(dplyr)

  Bilres <- read.xlsx(xlsxFile = "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx",sheet = "Storico_G", startRow = 1, colNames = TRUE)


  # Selecting Column R from Storico_G and stored in variable Bilres_df

  Bilres_df <- data.frame(Bilres$pubblicazione, Bilres$BILANCIAMENTO.RESIDUALE )

  # Conerting pubblicazione in date format and time
  Bilres_df$pubblicazione <- ymd_h(Bilres_df$Bilres.pubblicazione)
  Bilreslast=tail(Bilres_df,1)
  Bilreslast=data.frame(Bilreslast)
  Bilreslast$Bilres.BILANCIAMENTO.RESIDUALE <- as.numeric(as.character((Bilreslast$Bilres.BILANCIAMENTO.RESIDUALE)))

Upvotes: 0

Answers (3)

ASH

Reputation: 20302

This should get you going in the right direction.

library(data.table)
mydat <- fread('http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx')
head(mydat)

Upvotes: 0

Len Greski

Reputation: 10855

If you copy the URL from the web page, you can then use download.files() first to download as a binary file and use read.xlsx() to read the data. Depending on how frequently the content changes on the web page, you may be better off just copying the URL than parsing it from the page.

oldFile <- "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx"
newFile <- "http://www.snamretegas.it/repository/file/it/business-servizi/dati-operativi-business/dati_operativi_bilanciamento_sistema/2017/DatiOperativi_2017-IT.xlsx"

if(!file.exists("./data/downloadedXlsx.xlsx")){
     download.file(newFile,"./data/downloadedXlsx.xlsx",
                   method="curl", #use "curl" for OS X / Linux, "wininet" for Windows
                   mode="wb") # "wb" means "write binary"

} else message("file already loaded locally, using disk version")

library(openxlsx)
Bilres <- read.xlsx(xlsxFile = "./data/downloadedXlsx.xlsx",
                sheet = "Storico_G", startRow = 1, colNames = TRUE)
head(Bilres[,1:3])

...and the output:

> head(Bilres[,1:3])
  pubblicazione aggiornato.il IMMESSO
1 2017_01_01_06      42736.24 1915484
2 2017_01_01_07      42736.28 1915484
3 2017_01_01_08      42736.33 1866326
4 2017_01_01_09      42736.36 1866326
5 2017_01_01_10      42736.41 1866326
6 2017_01_01_11      42736.46 1866326
>

UPDATE: Added logic to avoid downloading the file once it has been downloaded.

Upvotes: 4

hrbrmstr

Reputation: 78792

You can find the .xlsx links this way:

library(rvest)
library(magrittr)

pg <- read_html("http://www.snamretegas.it/it/business-servizi/dati-operativi-business/8_dati_operativi_bilanciamento_sistema/")

# get all the Excel (xlsx) links on that page:

html_nodes(pg, xpath=".//a[contains(@href, '.xlsx')]") %>% 
  html_attr("href") %>% 
  sprintf("http://www.snamretegas.it%s", .) -> excel_links

head(excel_links)
## [1] "http://www.snamretegas.it/repository/file/it/business-servizi/dati-operativi-business/dati_operativi_bilanciamento_sistema/2017/DatiOperativi_2017-IT.xlsx"
## [2] "http://www.snamretegas.it/repository/file/it/business-servizi/dati-operativi-business/dati_operativi_bilanciamento_sistema/2018/DatiOperativi_2018-IT.xlsx"

And, pass in what you want to your Excel reading function:

openxlsx::read.xlsx(excel_links[1], sheet = "Storico_G", startRow = 1, colNames = TRUE)
## data frame output here that I'm not going to show

BUT!!

This is a very selfish and unkind way to do this since you hit that site for the Excel file every time you want to read it, wasting their CPU and bandwidth and your bandwidth.

You should use the download.file() technique Len described to cache a local copy and only re-download when necessary.

Upvotes: 4

R Web scrape Excel spreadsheet URLs to read with openxlsx

Answers (3)

Related Questions