user1162244
user1162244

Reputation: 43

R code: webscraping

I am trying to webscrape an OECD table with R.

library(XML)
OECD <- readHTMLTable('http://stats.oecd.org/Index.aspx?DataSetCode=MEI_CLI')
OECD<- data.frame(rawOECD[[1]])

I have managed to get the basic table with the above code but I am having trouble getting it into a presentable form.

I would be grateful for your help.

Kind regards,

Adam

Upvotes: 4

Views: 459

Answers (2)

eblondel
eblondel

Reputation: 603

An alternative of using HTML, in order to make abstraction of the "view" (which is likely to change, at least according to your data queries), and maybe to parametrize your data queries from R, is to consider the SDMX standard exchange format, supported by the OECD stats portal. If you click on export, select "SDMX", and copy the SDMX query web-request.

Then, in R, you can use easily rsdmx package:

require(rsdmx)
sdmx <- readSDMX("http://stats.oecd.org/restsdmx/sdmx.ashx/GetData/MEI_CLI/LOLITOAA+LOLITONO+LOLITOTR_STSA+LOLITOTR_GYSA+BSCICP03+CSCICP03+LORSGPRT+LORSGPNO+LORSGPTD+LORSGPOR_IXOBSA.AUS+AUT+BEL+CAN+CHL+CZE+DNK+EST+FIN+FRA+DEU+GRC+HUN+IRL+ISR+ITA+JPN+KOR+LUX+MEX+NLD+NZL+NOR+POL+PRT+SVK+SVN+ESP+SWE+CHE+TUR+GBR+USA+EA19+G4E+G-7+NAFTA+OECDE+OECD+ONM+A5M+BRA+CHN+IND+IDN+RUS+ZAF.M/all?startTime=2013-09&endTime=2015-08")
df <- as.data.frame(sdmx)

rsdmx now also provides a way to query data from well-known data providers, and OECD is part of this list. For this function, you will need to use rsdmx version >= 0.5 (at now only on Github). Here below an example:

sdmx <- readSDMX(providerId = "OECD", resource = "data", flowRef = "MEI_CLI",
                key = "all", key.mode = "SDMX",
                start = "2013-09", end = "2015-08")
df <- as.data.frame(sdmx)

Note: Consider also that you may use the same SDMX format and rsdmx to read metadata such as the data structure definition (also provided by OECD).

Hope this helps

Upvotes: 0

Pierre Lapointe
Pierre Lapointe

Reputation: 16277

How about this?

library(XML)
OECD <- readHTMLTable('http://stats.oecd.org/Index.aspx?DataSetCode=MEI_CLI',stringsAsFactors = FALSE)

n.rows <- unlist(lapply(OECD, function(t) dim(t)[1]))
out <-as.data.frame(OECD[[which.max(n.rows)]])
colnames(out) <-c("Date",out[7,-ncol(out)]) #add row names
out <-out[-(1:9),]  #clean up

Upvotes: 8

Related Questions