coding
coding

Reputation: 167

Pyspark - get all the contents of containers folder in a list azure synapse workspace and stored that data

In Synapse Workspace I'm using this function to get the all the files contained in the config container:

mssparkutils.fs.ls("abfss://[email protected]/")

and I got this list:

[FileInfo(path=abfss://[email protected]/config.json, name=config.json, size=26771),
 FileInfo(path=abfss://[email protected]/d365crm/Account.xml, name=Account.xml, size=3041),
 FileInfo(path=abfss://[email protected]/d365crm/Contact.xml, name=Contact.xml, size=1985),
 FileInfo(path=abfss://[email protected]/d365crm/Contract.xml, name=Contract.xml, size=1987)]

and I want to store this data in a pyspark dataframe.

I tried it with this code, but returns all null values:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, LongType

list_df = [item for sublist in list_for_dataframe for item in sublist]

schema = StructType([
    StructField("path", StringType(), True),
    StructField("name", StringType(), True),
    StructField("size", LongType(), True)
])

df = spark.createDataFrame(list_df, schema=schema)

# Show the DataFrame
df.show() 

and my ouput desired it will be:

enter image description here

Can anyone please help me in achieving this?

Thank you!

Upvotes: 0

Views: 2788

Answers (1)

Rakesh Govindula
Rakesh Govindula

Reputation: 11399

If you are able to get the recursive files list as above list, convert it into list of dictionaries. From that list get the dataframe.

For sample I have wasbs, this is my list from mssparkutils.fs.ls().

enter image description here

Convert it into list of dictionaties like below and you can see dataframe was created.

from pyspark.sql.types import StructType,StructField, StringType, IntegerType,LongType

data=[]
for i in files_list:
    d={}
    d["path"]=i.path
    d["name"]=i.name
    d["size"]=i.size
    data.append(d)
print("List of dictionaries : ",data)

schema = StructType([
    StructField("path", StringType(), True),
    StructField("name", StringType(), True),
    StructField("size", LongType(), True)
])

df = spark.createDataFrame(data, schema=schema)
print("Dataframe is: ")
display(df)

enter image description here

You can go through this blog by @Raki Rahman to learn about recursive files list and dataframe creation from it using pandas.

Upvotes: 1

Related Questions