Pyspark - get all the contents of containers folder in a list azure synapse workspace and stored that data

Question

In Synapse Workspace I'm using this function to get the all the files contained in the config container:

mssparkutils.fs.ls("abfss://config@datalake.dfs.core.windows.net/")

and I got this list:

[FileInfo(path=abfss://config@datalake.dfs.core.windows.net/config.json, name=config.json, size=26771),
 FileInfo(path=abfss://config@datalake.dfs.core.windows.net/d365crm/Account.xml, name=Account.xml, size=3041),
 FileInfo(path=abfss://config@datalake.dfs.core.windows.net/d365crm/Contact.xml, name=Contact.xml, size=1985),
 FileInfo(path=abfss://config@datalake.dfs.core.windows.net/d365crm/Contract.xml, name=Contract.xml, size=1987)]

and I want to store this data in a pyspark dataframe.

I tried it with this code, but returns all null values:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, LongType

list_df = [item for sublist in list_for_dataframe for item in sublist]

schema = StructType([
    StructField("path", StringType(), True),
    StructField("name", StringType(), True),
    StructField("size", LongType(), True)
])

df = spark.createDataFrame(list_df, schema=schema)

# Show the DataFrame
df.show()

and my ouput desired it will be:

Can anyone please help me in achieving this?

Thank you!

Rakesh Govindula · Accepted Answer

If you are able to get the recursive files list as above list, convert it into list of dictionaries. From that list get the dataframe.

For sample I have wasbs, this is my list from mssparkutils.fs.ls().

Convert it into list of dictionaties like below and you can see dataframe was created.

from pyspark.sql.types import StructType,StructField, StringType, IntegerType,LongType

data=[]
for i in files_list:
    d={}
    d["path"]=i.path
    d["name"]=i.name
    d["size"]=i.size
    data.append(d)
print("List of dictionaries : ",data)

schema = StructType([
    StructField("path", StringType(), True),
    StructField("name", StringType(), True),
    StructField("size", LongType(), True)
])

df = spark.createDataFrame(data, schema=schema)
print("Dataframe is: ")
display(df)

You can go through this blog by @Raki Rahman to learn about recursive files list and dataframe creation from it using pandas.

Pyspark - get all the contents of containers folder in a list azure synapse workspace and stored that data

Answers (1)

Related Questions