Reputation: 637
I need to read parts of an Excel file into R. I have some existing code, but the authority changed the source. Previously, there was a direct URL to the document, now the link to the document can only be accessed through a website landing page.
Could someone tell me with which package I can achieve that? The link to the Excel file is: http://www.snamretegas.it/it/business-servizi/dati-operativi-business/8_dati_operativi_bilanciamento_sistema/ There i am looking at the document: "Dati operativi relativi al bilanciamento del sistema post Del. 312/2016/R/gas - Database 2018"
I add the previous code to give an idea what I did. As you can see, I only required read.xlsx for this first step.
Many thanks in advance!
library(ggplot2)
library(lubridate)
library(openxlsx)
library(reshape2)
library(dplyr)
Bilres <- read.xlsx(xlsxFile = "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx",sheet = "Storico_G", startRow = 1, colNames = TRUE)
# Selecting Column R from Storico_G and stored in variable Bilres_df
Bilres_df <- data.frame(Bilres$pubblicazione, Bilres$BILANCIAMENTO.RESIDUALE )
# Conerting pubblicazione in date format and time
Bilres_df$pubblicazione <- ymd_h(Bilres_df$Bilres.pubblicazione)
Bilreslast=tail(Bilres_df,1)
Bilreslast=data.frame(Bilreslast)
Bilreslast$Bilres.BILANCIAMENTO.RESIDUALE <- as.numeric(as.character((Bilreslast$Bilres.BILANCIAMENTO.RESIDUALE)))
Upvotes: 0
Views: 2470
Reputation: 20302
This should get you going in the right direction.
library(data.table)
mydat <- fread('http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx')
head(mydat)
Upvotes: 0
Reputation: 10855
If you copy the URL from the web page, you can then use download.files()
first to download as a binary file and use read.xlsx()
to read the data. Depending on how frequently the content changes on the web page, you may be better off just copying the URL than parsing it from the page.
oldFile <- "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx"
newFile <- "http://www.snamretegas.it/repository/file/it/business-servizi/dati-operativi-business/dati_operativi_bilanciamento_sistema/2017/DatiOperativi_2017-IT.xlsx"
if(!file.exists("./data/downloadedXlsx.xlsx")){
download.file(newFile,"./data/downloadedXlsx.xlsx",
method="curl", #use "curl" for OS X / Linux, "wininet" for Windows
mode="wb") # "wb" means "write binary"
} else message("file already loaded locally, using disk version")
library(openxlsx)
Bilres <- read.xlsx(xlsxFile = "./data/downloadedXlsx.xlsx",
sheet = "Storico_G", startRow = 1, colNames = TRUE)
head(Bilres[,1:3])
...and the output:
> head(Bilres[,1:3])
pubblicazione aggiornato.il IMMESSO
1 2017_01_01_06 42736.24 1915484
2 2017_01_01_07 42736.28 1915484
3 2017_01_01_08 42736.33 1866326
4 2017_01_01_09 42736.36 1866326
5 2017_01_01_10 42736.41 1866326
6 2017_01_01_11 42736.46 1866326
>
UPDATE: Added logic to avoid downloading the file once it has been downloaded.
Upvotes: 4
Reputation: 78792
You can find the .xlsx
links this way:
library(rvest)
library(magrittr)
pg <- read_html("http://www.snamretegas.it/it/business-servizi/dati-operativi-business/8_dati_operativi_bilanciamento_sistema/")
# get all the Excel (xlsx) links on that page:
html_nodes(pg, xpath=".//a[contains(@href, '.xlsx')]") %>%
html_attr("href") %>%
sprintf("http://www.snamretegas.it%s", .) -> excel_links
head(excel_links)
## [1] "http://www.snamretegas.it/repository/file/it/business-servizi/dati-operativi-business/dati_operativi_bilanciamento_sistema/2017/DatiOperativi_2017-IT.xlsx"
## [2] "http://www.snamretegas.it/repository/file/it/business-servizi/dati-operativi-business/dati_operativi_bilanciamento_sistema/2018/DatiOperativi_2018-IT.xlsx"
And, pass in what you want to your Excel reading function:
openxlsx::read.xlsx(excel_links[1], sheet = "Storico_G", startRow = 1, colNames = TRUE)
## data frame output here that I'm not going to show
BUT!!
This is a very selfish and unkind way to do this since you hit that site for the Excel file every time you want to read it, wasting their CPU and bandwidth and your bandwidth.
You should use the download.file()
technique Len described to cache a local copy and only re-download when necessary.
Upvotes: 4