Reputation: 429
I want to get a list of all parquet file names from a directory in Azure datalake in Pyspark. The long file names starting with 'part-'
How to achieve this?
Upvotes: 0
Views: 2789
Reputation: 11529
I reproduced this and got below results.
These are my parquet files in ADLS container.
To get these files in synapse first mount the ADLS to synapse using ADLS linked service.
After mounting, use the below code to get the parquet files that starts with part
.
files_list=mssparkutils.fs.ls("abfss://<container_name>@<storageaccount_name>.dfs.core.windows.net/")
print("Total files list : ",files_list)
flist=[]
for i in range(0,len(files_list)):
if(files_list[i].name.startswith('part')):
flist.append(files_list[i].path)
print("\n \n File paths that starts with part",flist)
My Execution for your reference:
If you want to read all files you can just use wild card path part*
in the file path like this.
df=spark.read.parquet("abfss://<container_name><storageaccount_name>.dfs.core.windows.net/part*.parquet")
Upvotes: 0