sparc
sparc

Reputation: 429

How to get list of parquet file names from a directory in Azure datalake in pyspark?

I want to get a list of all parquet file names from a directory in Azure datalake in Pyspark. The long file names starting with 'part-'

How to achieve this?

Upvotes: 0

Views: 2789

Answers (1)

Rakesh Govindula
Rakesh Govindula

Reputation: 11529

I reproduced this and got below results.

These are my parquet files in ADLS container.

enter image description here

To get these files in synapse first mount the ADLS to synapse using ADLS linked service.

After mounting, use the below code to get the parquet files that starts with part.

files_list=mssparkutils.fs.ls("abfss://<container_name>@<storageaccount_name>.dfs.core.windows.net/")
print("Total files list : ",files_list)
flist=[]
for i in  range(0,len(files_list)):
    if(files_list[i].name.startswith('part')):
        flist.append(files_list[i].path)
print("\n  \n File paths that starts with part",flist)

My Execution for your reference:

enter image description here

If you want to read all files you can just use wild card path part* in the file path like this.

df=spark.read.parquet("abfss://<container_name><storageaccount_name>.dfs.core.windows.net/part*.parquet")

enter image description here

Upvotes: 0

Related Questions