Pyspark - how to create a spark dataframe from xlsx bytes data stored in a variable?

I'm using requests to authenticate into a website and download an xlsx report. This website does not disponibilize any other format.

s = requests.Session()
r = s.request(
    method = 'POST', 
    url = #website
    data = {
        'email': #my email
        'password': #my password
        }
 )

#do some stuff with s until I get the report url

file = s.get(file_url)

`file is a request object with the file data

>>> file.content

Out: b'PK\x03\x04\x1/... more bytes data'

If I had a file on disk, I could do

sparkSession.read.format("com.crealytics.spark.excel")
  .option("dataAddress", "'Sheet1'!")
  .option("header", "true")
  .load(#path to file in disk)

But I don't have permission to writing files in the file system so I can't write the file and then load it. Also, it would be an unnecessary write-read step.

If I have the data in the variable, surely I can do something like

sparkSession.read.format("com.crealytics.spark.excel")
  .option("dataAddress", "'Sheet1'!")
  .option("header", "true")
  .load_from_bytes(file.content) #this method doesn't exists. This is what I want.

Is there a way to do it? How do I do it?

Upvotes: 1

Answers (2)

André Luiz Tiago Soares

Reputation: 81

Update.

I haven't yet found the perferct workaround for this problem. I believe Alex Ott's answer will work for most people. Our development cluster, however, has access control that prevents me from writing or reading from dbfs. I could still run his solution as a job on a separate cluster without access control, but I can't test it during development, so I chose not to use it.

What I'm doing is making a pandas dataframe and converting that to a spark dataframe.

%pip install xlrd

pandas_df = pd.read_excel(file.content)

from pyspark.sql import types

file_struct = types.StructType([
  #... Sruct Fields and all that good stuff
])

spark_df = spark.createDataFrame(pandas_df, file_struct)

#... do stuff with spark_df

It is not very elegant, but it works both as a job and during testing, so I decided to go with it and move on with my life.

Upvotes: 0

Alex Ott

Reputation: 87279

Just store data into file on DBFS, and read it:

with open("/dbfs/tmp/my_file.xslx", "wb") as f:
  f.write(file.content)
sparkSession.read.format("com.crealytics.spark.excel")
  .option("dataAddress", "'Sheet1'!")
  .option("header", "true")
  .load("/tmp/my_file.xslx")
.... work with data from file

# remove file
dbutils.fs.rm("/tmp/my_file.xslx")

otherwise other nodes won't be able to access this file that exists only in memory on the driver node.

P.S. If you're doing this on community edition, then you need to write to local disk first, and then copy to DBFS:

with open("/tmp/my_file.xslx", "wb") as f:
  f.write(file.content)
dbutils.fs.cp("file:///tmp/my_file.xslx", "dbfs:/tmp/my_file.xslx")

Upvotes: 1

Pyspark - how to create a spark dataframe from xlsx bytes data stored in a variable?

Answers (2)

Related Questions