maclura
maclura

Reputation: 3

Copy sharepoint binary files to OneLake with Pyspark

I am trying to develop a general-purpose pipeline able to ingest to a OneLake Fabric folder all the files contained in a Sharepoint Online folder, without any transformation, a 1 to 1 copy of those files (in my case .xlsx files).

Since Fabric Web/Http connections cannot be parametrized (yet), I am trying to do this using a notebook written in PySpark, so I can pass all the customized parameters to the notebook from the pipeline to get a true parametric execution. Parameters like e.g.
Example of Parameter list

What I've done up to now:

  1. Get an access token for the Sharepoint tenant for my registered app in EntraID

  2. Connect to Sharepoint Online using the Sharepoint API and get the sharepoint folder content (using requests module)

  3. Extract each file from that folder (using requests module) and try to write them to OneLake in binary format.

Now I am stuck in accessing OneLake with PySpark io

When I get my binary file from the REST API call, I tried this code:

response = requests.get(url, headers=headers) 
bytes_stream = io.BytesIO(response.content) 
with open(file_path, "wb") as file:
     file.write(bytes_stream.read()) 
bytes_stream.close() 

but I get always the same error:

FileNotFoundError: [Errno 2] No such file or directory:

Path and filename are correct, using both absolute or relative path, and it works correctly when testing writing or reading the text files using notebookutils.

Any help is greatly appreciated.

Upvotes: 0

Views: 144

Answers (1)

David Browne - Microsoft
David Browne - Microsoft

Reputation: 89361

The notebook's default lakehouse is mounted into the filesystem at /lakehouse/default/, so

file_path = '/lakehouse/default/Files/SomeFolder/SomeFile.xslx'

and Bob's your uncle.

Upvotes: 1

Related Questions