XLS to CSV or R data.frame

Question

I need to (non-manually) download this file and convert the contents to a data.frame, the ability to ignore a few rows would be useful. I'm looking specifically for a solution in R or Python.

The file itself can be taken from:

http://horizons.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en

Here's what I've done so far:

I've tried XLConnect (Error: IllegalArgumentException (Java): Your InputStream was neither an OLE2 stream, nor an OOXML stream)
I've tried RODBC (Error in odbcConnectExcel("xl.file") : odbcConnectExcel is only usable with 32-bit Windows)
I've tried xlrd in Python (XLRDError: Unsupported format or corrupt file)
I've tried gdata (Error in xls2sep(xls, sheet, verbose = verbose, ..., method = method, : Intermediate file '...' missing!)

If you open the file in notepad, it's an xml file, and when opening in Excel, you get a warning message "The format and extension don't match".

Ideas that I can explore myself are also useful, should you not have an answer, please comment.

My attempt so far with XML/regex:

library(XML)
library(stringr)
download.file("http://horizons.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en", destfile = "horizons.xls")
doc <- readLines(con = "horizons.xls")
doc <- str_extract(doc,"]*>(.*?)")
doc <- xmlParse(doc)
listing <- xpathApply(doc, "//Row", xmlToDataFrame)
listing <- listing[4:length(listing)]
listing <- do.call(rbind,lapply(listing, t))[,6:16]
listing[,3:11] <- gsub("[^-.0-9]", "", listing[,3:11])
listing <- as.data.frame(listing, row.names = NULL,stringsAsFactors = FALSE,)
listing$V1 <- str_replace_all(listing$V1, "[^a-zA-Z0-9]", " ")
listing[5:11] <- lapply(listing[5:11],as.numeric)
names(listing) <- c(
    "Product Name",
    "Ticker",
    "Class",
    "Price",
    "Price % Change",
    "Volume",
    "NAV/unit",
    "NAV % Change",
    "% Prem/Disc",
    "Outst. Shares"
)

lukeA · Accepted Answer

Maybe a way to do it in R:

library(XML)
download.file("http://horizons.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en", file.path(tempdir(), "xls.xml"))
doc <- xmlParse(file.path(tempdir(), "xls.xml"))
df <- xmlToDataFrame(nodes = getNodeSet(doc, "//ss:Row", "ss")[-(1:2)], stringsAsFactors = FALSE)
names(df) <- unlist(df[1, ], use.names = F); df <- df[-1, ] # put first row as col header and delete it
head(df)
# # # Language            ETF Type              Subtype                                 Product Name Ticker Class Closing Date   Price Price % Change Volume NAV/unit NAV % Change % Prem/Disc Outst. Shares
# 2 1 1       en INDEX AND BENCHMARK Equities — Large Cap                  Horizons S&P 500® Index ETF    HXS         2015-03-30 47.3800           2.09 314223  47.4302       1.9621       -0.11       5675671
# 3 2 2       en                                                     Horizons S&P 500® Index ETF (US$)  HXS.U         2015-03-30 37.2800          -0.19  52769  37.3539       1.2312       -0.20       5675671
# 4 3 3       en                                                        Horizons S&P/TSX 60™ Index ETF    HXT         2015-03-30 27.9600           0.98 372656  27.9144       0.9095        0.16      22019328
# 5 4 4       en                                                  Horizons S&P/TSX 60™ Index ETF (US$)  HXT.U         2015-03-30 22.0300          -0.56      0  21.9842       0.1864        0.21      22019328
# 6 5 5       en                                              Horizons S&P/TSX Capped Energy Index ETF    HXE         2015-03-30 21.4800           0.00   1200  21.5441       0.6578       -0.30        902485
# 7 6 6       en                                          Horizons S&P/TSX Capped Financials Index ETF    HXF         2015-03-30 30.0100           0.00    900  30.0804       0.1395       -0.23        500440

XLS to CSV or R data.frame

Answers (2)

Related Questions