Reputation: 44648
I need to (non-manually) download this file and convert the contents to a data.frame, the ability to ignore a few rows would be useful. I'm looking specifically for a solution in R or Python.
The file itself can be taken from:
http://horizons.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en
Here's what I've done so far:
Error: IllegalArgumentException (Java): Your InputStream was neither an OLE2 stream, nor an OOXML stream
) Error in odbcConnectExcel("xl.file") :
odbcConnectExcel is only usable with 32-bit Windows
) XLRDError: Unsupported format or corrupt file
)Error in xls2sep(xls, sheet, verbose = verbose, ..., method = method, : Intermediate file '...' missing!
) If you open the file in notepad, it's an xml file, and when opening in Excel, you get a warning message "The format and extension don't match".
Ideas that I can explore myself are also useful, should you not have an answer, please comment.
My attempt so far with XML/regex:
library(XML)
library(stringr)
download.file("http://horizons.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en", destfile = "horizons.xls")
doc <- readLines(con = "horizons.xls")
doc <- str_extract(doc,"<Table[^>]*>(.*?)</Table>")
doc <- xmlParse(doc)
listing <- xpathApply(doc, "//Row", xmlToDataFrame)
listing <- listing[4:length(listing)]
listing <- do.call(rbind,lapply(listing, t))[,6:16]
listing[,3:11] <- gsub("[^-.0-9]", "", listing[,3:11])
listing <- as.data.frame(listing, row.names = NULL,stringsAsFactors = FALSE,)
listing$V1 <- str_replace_all(listing$V1, "[^a-zA-Z0-9]", " ")
listing[5:11] <- lapply(listing[5:11],as.numeric)
names(listing) <- c(
"Product Name",
"Ticker",
"Class",
"Price",
"Price % Change",
"Volume",
"NAV/unit",
"NAV % Change",
"% Prem/Disc",
"Outst. Shares"
)
Upvotes: 0
Views: 225
Reputation: 12713
This may not be the best way, but will help to give some heads up with it.
require("XML")
myfile1 <- download.file(http://horizons.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en)
doc <- xmlParse(myfile1)
root_doc <- xmlRoot(doc)
identify_worksheet <- c()
for (i in 1:xmlSize(root_doc)){
identify_worksheet <- c(identify_worksheet, xmlName(root_doc[[i]]) == "Worksheet")
}
worksheet_index = which(identify_worksheet == TRUE)
name1 <- xmlSApply(root_doc[[worksheet_index]], xmlName)
row_size <- xmlSize(root_doc[[worksheet_index]][[name1]])
col_size = max(xmlSApply(root_doc[[worksheet_index]][[name1]], xmlSize))
row_index = which(xmlSApply(root_doc[[worksheet_index]][[name1]], xmlSize) == max(xmlSApply(root_doc[[worksheet_index]][[name1]], xmlSize)))
df1 <- data.frame(matrix(nrow = length(row_index)-1, ncol = col_size), stringsAsFactors = FALSE)
colnames(df1) <- getChildrenStrings(root_doc[[worksheet_index]][[name1]][[row_index[1]]])
for(i in 2:length(row_index)){
df_index = i-1
df1[df_index,] <- getChildrenStrings(root_doc[[worksheet_index]][[name1]][[row_index[i]]])
}
View(df1)
df2 <- df1[4:ncol(df1)]
View(df2)
Identify Names from an excel sheet in xml format. I would wish to know, the following names is the standard followed for an excel sheet in xml format and also, when there are multiple worksheets, does the names of worksheets increment with number as suffix (Eg: Worksheet1, Worksheet2, and so on..).
xmlName(root_doc)
[1] "Workbook"
xmlName(root_doc[[1]])
[1] "DocumentProperties"
xmlName(root_doc[[2]])
[1] "Styles"
xmlName(root_doc[[3]])
[1] "Worksheet"
output
head(df1)
# # Language ETF Type Subtype Product Name Ticker Class Closing Date Price Price % Change Volume
1 1 1 en INDEX AND BENCHMARK Equities — Large Cap Horizons S&P 500® Index ETF HXS 2015-03-30 47.3800 2.09 314223
2 2 2 en Horizons S&P 500® Index ETF (US$) HXS.U 2015-03-30 37.2800 -0.19 52769
3 3 3 en Horizons S&P/TSX 60™ Index ETF HXT 2015-03-30 27.9600 0.98 372656
4 4 4 en Horizons S&P/TSX 60™ Index ETF (US$) HXT.U 2015-03-30 22.0300 -0.56 0
5 5 5 en Horizons S&P/TSX Capped Energy Index ETF HXE 2015-03-30 21.4800 0.00 1200
6 6 6 en Horizons S&P/TSX Capped Financials Index ETF HXF 2015-03-30 30.0100 0.00 900
NAV/unit NAV % Change % Prem/Disc Outst. Shares
1 47.4302 1.9621 -0.11 5675671
2 37.3539 1.2312 -0.20 5675671
3 27.9144 0.9095 0.16 22019328
4 21.9842 0.1864 0.21 22019328
5 21.5441 0.6578 -0.30 902485
6 30.0804 0.1395 -0.23 500440
head(df2)
ETF Type Subtype Product Name Ticker Class Closing Date Price Price % Change Volume NAV/unit
1 INDEX AND BENCHMARK Equities — Large Cap Horizons S&P 500® Index ETF HXS 2015-03-30 47.3800 2.09 314223 47.4302
2 Horizons S&P 500® Index ETF (US$) HXS.U 2015-03-30 37.2800 -0.19 52769 37.3539
3 Horizons S&P/TSX 60™ Index ETF HXT 2015-03-30 27.9600 0.98 372656 27.9144
4 Horizons S&P/TSX 60™ Index ETF (US$) HXT.U 2015-03-30 22.0300 -0.56 0 21.9842
5 Horizons S&P/TSX Capped Energy Index ETF HXE 2015-03-30 21.4800 0.00 1200 21.5441
6 Horizons S&P/TSX Capped Financials Index ETF HXF 2015-03-30 30.0100 0.00 900 30.0804
NAV % Change % Prem/Disc Outst. Shares
1 1.9621 -0.11 5675671
2 1.2312 -0.20 5675671
3 0.9095 0.16 22019328
4 0.1864 0.21 22019328
5 0.6578 -0.30 902485
6 0.1395 -0.23 500440
Upvotes: 1
Reputation: 54237
Maybe a way to do it in R:
library(XML)
download.file("http://horizons.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en", file.path(tempdir(), "xls.xml"))
doc <- xmlParse(file.path(tempdir(), "xls.xml"))
df <- xmlToDataFrame(nodes = getNodeSet(doc, "//ss:Row", "ss")[-(1:2)], stringsAsFactors = FALSE)
names(df) <- unlist(df[1, ], use.names = F); df <- df[-1, ] # put first row as col header and delete it
head(df)
# # # Language ETF Type Subtype Product Name Ticker Class Closing Date Price Price % Change Volume NAV/unit NAV % Change % Prem/Disc Outst. Shares
# 2 1 1 en INDEX AND BENCHMARK Equities — Large Cap Horizons S&P 500® Index ETF HXS 2015-03-30 47.3800 2.09 314223 47.4302 1.9621 -0.11 5675671
# 3 2 2 en Horizons S&P 500® Index ETF (US$) HXS.U 2015-03-30 37.2800 -0.19 52769 37.3539 1.2312 -0.20 5675671
# 4 3 3 en Horizons S&P/TSX 60™ Index ETF HXT 2015-03-30 27.9600 0.98 372656 27.9144 0.9095 0.16 22019328
# 5 4 4 en Horizons S&P/TSX 60™ Index ETF (US$) HXT.U 2015-03-30 22.0300 -0.56 0 21.9842 0.1864 0.21 22019328
# 6 5 5 en Horizons S&P/TSX Capped Energy Index ETF HXE 2015-03-30 21.4800 0.00 1200 21.5441 0.6578 -0.30 902485
# 7 6 6 en Horizons S&P/TSX Capped Financials Index ETF HXF 2015-03-30 30.0100 0.00 900 30.0804 0.1395 -0.23 500440
Upvotes: 2