Reputation: 167
In Synapse Workspace I'm using this function to get the all the files contained in the config container:
mssparkutils.fs.ls("abfss://[email protected]/")
and I got this list:
[FileInfo(path=abfss://[email protected]/config.json, name=config.json, size=26771),
FileInfo(path=abfss://[email protected]/d365crm/Account.xml, name=Account.xml, size=3041),
FileInfo(path=abfss://[email protected]/d365crm/Contact.xml, name=Contact.xml, size=1985),
FileInfo(path=abfss://[email protected]/d365crm/Contract.xml, name=Contract.xml, size=1987)]
and I want to store this data in a pyspark dataframe.
I tried it with this code, but returns all null values:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, LongType
list_df = [item for sublist in list_for_dataframe for item in sublist]
schema = StructType([
StructField("path", StringType(), True),
StructField("name", StringType(), True),
StructField("size", LongType(), True)
])
df = spark.createDataFrame(list_df, schema=schema)
# Show the DataFrame
df.show()
and my ouput desired it will be:
Can anyone please help me in achieving this?
Thank you!
Upvotes: 0
Views: 2788
Reputation: 11399
If you are able to get the recursive files list as above list, convert it into list of dictionaries. From that list get the dataframe.
For sample I have wasbs
, this is my list from mssparkutils.fs.ls()
.
Convert it into list of dictionaties like below and you can see dataframe was created.
from pyspark.sql.types import StructType,StructField, StringType, IntegerType,LongType
data=[]
for i in files_list:
d={}
d["path"]=i.path
d["name"]=i.name
d["size"]=i.size
data.append(d)
print("List of dictionaries : ",data)
schema = StructType([
StructField("path", StringType(), True),
StructField("name", StringType(), True),
StructField("size", LongType(), True)
])
df = spark.createDataFrame(data, schema=schema)
print("Dataframe is: ")
display(df)
You can go through this blog by @Raki Rahman to learn about recursive files list and dataframe creation from it using pandas.
Upvotes: 1