Reputation: 7514
I have this website that requires log in to access data.
import pandas as pd
import requests
r = requests.get(my_url, cookies=my_cookies) # my_cookies are imported from a selenium session.
df = pd.io.excel.read_excel(r.content, sheetname=0)
Reponse:
IOError: [Errno 2] No such file or directory: 'Ticker\tAction\tName\tShares\tPrice\...
Apparently, the str is processed as a filename. Is there a way to process it as a file? Alternatively can we pass cookies to pd.get_html?
EDIT: After further processing we can now see that this is actually a csv file. The content of the downloaded file is:
In [201]: r.content
Out [201]: 'Ticker\tAction\tName\tShares\tPrice\tCommission\tAmount\tTarget Weight\nBRSS\tSELL\tGlobal Brass and Copper Holdings Inc\t400.0\t17.85\t-1.00\t7,140\t0.00\nCOHU\tSELL\tCohu Inc\t700.0\t12.79\t-1.00\t8,953\t0.00\nUNTD\tBUY\tUnited Online Inc\t560.0\t15.15\t-1.00\t-8,484\t0.00\nFLXS\tBUY\tFlexsteel Industries Inc\t210.0\t40.31\t-1.00\t-8,465\t0.00\nUPRO\tCOVER\tProShares UltraPro S&P500\t17.0\t71.02\t-0.00\t-1,207\t0.00\n'
Notice that it is tab delimited. Still, trying:
# csv version 1
df = pd.read_csv(r.content)
# Returns error, file does not exist. Apparently read_csv() is also trying to read it as a file.
# csv version 2
fh = io.BytesIO(r.content)
df = pd.read_csv(fh) # ValueError: No columns to parse from file.
# csv version 3
s = StringIO(r.content)
df = pd.read_csv(s)
# No error, but the resulting df is not parsed properly; \t's show up in the text of the dataframe.
Upvotes: 3
Views: 4767
Reputation: 784
Simply wrap the file contents in a BytesIO:
with io.BytesIO(r.content) as fh:
df = pd.io.excel.read_excel(fh, sheetname=0)
Upvotes: 4
Reputation: 50540
This functionality was included in an update from 2014. According to the documentation it is as simple as providing the url:
The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file://localhost/path/to/workbook.xlsx
Based on the code you've provided, it looks like you are using pandas 0.13.x? If you can upgrade to a newer version (code below is tested with 0.16.x) you can get this to work without the additional utilization of the requests
library. This was added in 0.14.1
data2 = pd.read_excel(data_url)
As an example of a full script (with the example XLS document taken from the original bug report stating the read_excel
didn't accept a URL):
import pandas as pd
data_url = "http://www.eia.gov/dnav/pet/xls/PET_PRI_ALLMG_A_EPM0_PTC_DPGAL_M.xls"
data = pd.read_excel(data_url, "Data 1", skiprows=2)
Upvotes: 1