Harry Leboeuf
Harry Leboeuf

Reputation: 745

Python Pandas read csv from DataLake

I'm trying to read a csv file that is stored on a Azure Data Lake Gen 2, Python runs in Databricks. Here are 2 lines of code, the first one works, the seconds one fails. Do I really have to mount the Adls to have Pandas being able to access it.

data1 = spark.read.option("header",False).format("csv").load("abfss://[email protected]/belgium/dessel/c3/kiln/temp/Auto202012101237.TXT")
data2 = pd.read_csv("abfss://[email protected]/belgium/dessel/c3/kiln/temp/Auto202012101237.TXT")

Any suggestions ?

Upvotes: 1

Views: 7929

Answers (2)

Alex Ott
Alex Ott

Reputation: 87359

Pandas doesn't know about cloud storage, and works with local files only. On Databricks you should be able to copy the file locally, so you can open it with Pandas. This could be done either with %fs cp abfss://.... file:/your-location or with dbutils.fs.cp("abfss://....", "file:/your-location") (see docs).

Another possibility is instead of Pandas, use the Koalas library that provides Pandas-compatible API on top of the Spark. Besides ability to access data in the cloud, you'll also get a possibility to run your code in the distributed fashion.

Upvotes: 2

Harry Leboeuf
Harry Leboeuf

Reputation: 745

I could solve it by mounting the cloud storage as a drive. Works fine now.

Upvotes: 0

Related Questions