Brandon Bertelsen
Brandon Bertelsen

Reputation: 44648

XLS to CSV or R data.frame

I need to (non-manually) download this file and convert the contents to a data.frame, the ability to ignore a few rows would be useful. I'm looking specifically for a solution in R or Python.

The file itself can be taken from:

http://horizons.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en

Here's what I've done so far:

  1. I've tried XLConnect (Error: IllegalArgumentException (Java): Your InputStream was neither an OLE2 stream, nor an OOXML stream)
  2. I've tried RODBC (Error in odbcConnectExcel("xl.file") : odbcConnectExcel is only usable with 32-bit Windows)
  3. I've tried xlrd in Python (XLRDError: Unsupported format or corrupt file)
  4. I've tried gdata (Error in xls2sep(xls, sheet, verbose = verbose, ..., method = method, : Intermediate file '...' missing!)

If you open the file in notepad, it's an xml file, and when opening in Excel, you get a warning message "The format and extension don't match".

Ideas that I can explore myself are also useful, should you not have an answer, please comment.

My attempt so far with XML/regex:

library(XML)
library(stringr)
download.file("http://horizons.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en", destfile = "horizons.xls")
doc <- readLines(con = "horizons.xls")
doc <- str_extract(doc,"<Table[^>]*>(.*?)</Table>")
doc <- xmlParse(doc)
listing <- xpathApply(doc, "//Row", xmlToDataFrame)
listing <- listing[4:length(listing)]
listing <- do.call(rbind,lapply(listing, t))[,6:16]
listing[,3:11] <- gsub("[^-.0-9]", "", listing[,3:11])
listing <- as.data.frame(listing, row.names = NULL,stringsAsFactors = FALSE,)
listing$V1 <- str_replace_all(listing$V1, "[^a-zA-Z0-9]", " ")
listing[5:11] <- lapply(listing[5:11],as.numeric)
names(listing) <- c(
    "Product Name",
    "Ticker",
    "Class",
    "Price",
    "Price % Change",
    "Volume",
    "NAV/unit",
    "NAV % Change",
    "% Prem/Disc",
    "Outst. Shares"
)

Upvotes: 0

Views: 225

Answers (2)

Sathish
Sathish

Reputation: 12713

This may not be the best way, but will help to give some heads up with it.

require("XML")

myfile1 <- download.file(http://horizons.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en)

doc <- xmlParse(myfile1)

root_doc <- xmlRoot(doc)

identify_worksheet <- c()

for (i in 1:xmlSize(root_doc)){ 
  identify_worksheet <- c(identify_worksheet, xmlName(root_doc[[i]]) == "Worksheet") 
}

worksheet_index = which(identify_worksheet == TRUE)

name1 <- xmlSApply(root_doc[[worksheet_index]], xmlName)

row_size <- xmlSize(root_doc[[worksheet_index]][[name1]])

col_size = max(xmlSApply(root_doc[[worksheet_index]][[name1]], xmlSize))

row_index = which(xmlSApply(root_doc[[worksheet_index]][[name1]], xmlSize) == max(xmlSApply(root_doc[[worksheet_index]][[name1]], xmlSize)))

df1 <- data.frame(matrix(nrow = length(row_index)-1, ncol = col_size), stringsAsFactors = FALSE)

colnames(df1) <- getChildrenStrings(root_doc[[worksheet_index]][[name1]][[row_index[1]]])

for(i in 2:length(row_index)){
  df_index = i-1
  df1[df_index,] <- getChildrenStrings(root_doc[[worksheet_index]][[name1]][[row_index[i]]])
}

View(df1)

df2 <- df1[4:ncol(df1)]

View(df2)

Identify Names from an excel sheet in xml format. I would wish to know, the following names is the standard followed for an excel sheet in xml format and also, when there are multiple worksheets, does the names of worksheets increment with number as suffix (Eg: Worksheet1, Worksheet2, and so on..).

xmlName(root_doc)
[1] "Workbook"
xmlName(root_doc[[1]])
[1] "DocumentProperties"
xmlName(root_doc[[2]])
[1] "Styles"
xmlName(root_doc[[3]])
[1] "Worksheet"

output

head(df1)
  # # Language            ETF Type              Subtype                                 Product Name Ticker Class Closing Date   Price Price % Change Volume
1 1 1       en INDEX AND BENCHMARK Equities — Large Cap                  Horizons S&P 500® Index ETF    HXS         2015-03-30 47.3800           2.09 314223
2 2 2       en                                                     Horizons S&P 500® Index ETF (US$)  HXS.U         2015-03-30 37.2800          -0.19  52769
3 3 3       en                                                        Horizons S&P/TSX 60™ Index ETF    HXT         2015-03-30 27.9600           0.98 372656
4 4 4       en                                                  Horizons S&P/TSX 60™ Index ETF (US$)  HXT.U         2015-03-30 22.0300          -0.56      0
5 5 5       en                                              Horizons S&P/TSX Capped Energy Index ETF    HXE         2015-03-30 21.4800           0.00   1200
6 6 6       en                                          Horizons S&P/TSX Capped Financials Index ETF    HXF         2015-03-30 30.0100           0.00    900
  NAV/unit NAV % Change % Prem/Disc Outst. Shares
1  47.4302       1.9621       -0.11       5675671
2  37.3539       1.2312       -0.20       5675671
3  27.9144       0.9095        0.16      22019328
4  21.9842       0.1864        0.21      22019328
5  21.5441       0.6578       -0.30        902485
6  30.0804       0.1395       -0.23        500440

head(df2)
             ETF Type              Subtype                                 Product Name Ticker Class Closing Date   Price Price % Change Volume NAV/unit
1 INDEX AND BENCHMARK Equities — Large Cap                  Horizons S&P 500® Index ETF    HXS         2015-03-30 47.3800           2.09 314223  47.4302
2                                                     Horizons S&P 500® Index ETF (US$)  HXS.U         2015-03-30 37.2800          -0.19  52769  37.3539
3                                                        Horizons S&P/TSX 60™ Index ETF    HXT         2015-03-30 27.9600           0.98 372656  27.9144
4                                                  Horizons S&P/TSX 60™ Index ETF (US$)  HXT.U         2015-03-30 22.0300          -0.56      0  21.9842
5                                              Horizons S&P/TSX Capped Energy Index ETF    HXE         2015-03-30 21.4800           0.00   1200  21.5441
6                                          Horizons S&P/TSX Capped Financials Index ETF    HXF         2015-03-30 30.0100           0.00    900  30.0804
  NAV % Change % Prem/Disc Outst. Shares
1       1.9621       -0.11       5675671
2       1.2312       -0.20       5675671
3       0.9095        0.16      22019328
4       0.1864        0.21      22019328
5       0.6578       -0.30        902485
6       0.1395       -0.23        500440

Upvotes: 1

lukeA
lukeA

Reputation: 54237

Maybe a way to do it in R:

library(XML)
download.file("http://horizons.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en", file.path(tempdir(), "xls.xml"))
doc <- xmlParse(file.path(tempdir(), "xls.xml"))
df <- xmlToDataFrame(nodes = getNodeSet(doc, "//ss:Row", "ss")[-(1:2)], stringsAsFactors = FALSE)
names(df) <- unlist(df[1, ], use.names = F); df <- df[-1, ] # put first row as col header and delete it
head(df)
# # # Language            ETF Type              Subtype                                 Product Name Ticker Class Closing Date   Price Price % Change Volume NAV/unit NAV % Change % Prem/Disc Outst. Shares
# 2 1 1       en INDEX AND BENCHMARK Equities — Large Cap                  Horizons S&P 500® Index ETF    HXS         2015-03-30 47.3800           2.09 314223  47.4302       1.9621       -0.11       5675671
# 3 2 2       en                                                     Horizons S&P 500® Index ETF (US$)  HXS.U         2015-03-30 37.2800          -0.19  52769  37.3539       1.2312       -0.20       5675671
# 4 3 3       en                                                        Horizons S&P/TSX 60™ Index ETF    HXT         2015-03-30 27.9600           0.98 372656  27.9144       0.9095        0.16      22019328
# 5 4 4       en                                                  Horizons S&P/TSX 60™ Index ETF (US$)  HXT.U         2015-03-30 22.0300          -0.56      0  21.9842       0.1864        0.21      22019328
# 6 5 5       en                                              Horizons S&P/TSX Capped Energy Index ETF    HXE         2015-03-30 21.4800           0.00   1200  21.5441       0.6578       -0.30        902485
# 7 6 6       en                                          Horizons S&P/TSX Capped Financials Index ETF    HXF         2015-03-30 30.0100           0.00    900  30.0804       0.1395       -0.23        500440

Upvotes: 2

Related Questions