Reputation: 81
I'm using requests to authenticate into a website and download an xlsx report. This website does not disponibilize any other format.
s = requests.Session()
r = s.request(
method = 'POST',
url = #website
data = {
'email': #my email
'password': #my password
}
)
#do some stuff with s until I get the report url
file = s.get(file_url)
`file is a request object with the file data
>>> file.content
Out: b'PK\x03\x04\x1/... more bytes data'
If I had a file on disk, I could do
sparkSession.read.format("com.crealytics.spark.excel")
.option("dataAddress", "'Sheet1'!")
.option("header", "true")
.load(#path to file in disk)
But I don't have permission to writing files in the file system so I can't write the file and then load it. Also, it would be an unnecessary write-read step.
If I have the data in the variable, surely I can do something like
sparkSession.read.format("com.crealytics.spark.excel")
.option("dataAddress", "'Sheet1'!")
.option("header", "true")
.load_from_bytes(file.content) #this method doesn't exists. This is what I want.
Is there a way to do it? How do I do it?
Upvotes: 1
Views: 2107
Reputation: 81
Update.
I haven't yet found the perferct workaround for this problem. I believe Alex Ott's answer will work for most people. Our development cluster, however, has access control that prevents me from writing or reading from dbfs. I could still run his solution as a job on a separate cluster without access control, but I can't test it during development, so I chose not to use it.
What I'm doing is making a pandas dataframe and converting that to a spark dataframe.
%pip install xlrd
pandas_df = pd.read_excel(file.content)
from pyspark.sql import types
file_struct = types.StructType([
#... Sruct Fields and all that good stuff
])
spark_df = spark.createDataFrame(pandas_df, file_struct)
#... do stuff with spark_df
It is not very elegant, but it works both as a job and during testing, so I decided to go with it and move on with my life.
Upvotes: 0
Reputation: 87279
Just store data into file on DBFS, and read it:
with open("/dbfs/tmp/my_file.xslx", "wb") as f:
f.write(file.content)
sparkSession.read.format("com.crealytics.spark.excel")
.option("dataAddress", "'Sheet1'!")
.option("header", "true")
.load("/tmp/my_file.xslx")
.... work with data from file
# remove file
dbutils.fs.rm("/tmp/my_file.xslx")
otherwise other nodes won't be able to access this file that exists only in memory on the driver node.
P.S. If you're doing this on community edition, then you need to write to local disk first, and then copy to DBFS:
with open("/tmp/my_file.xslx", "wb") as f:
f.write(file.content)
dbutils.fs.cp("file:///tmp/my_file.xslx", "dbfs:/tmp/my_file.xslx")
Upvotes: 1