Reputation: 193
How to create a spark data frame from a nested dictionary? I'm new to spark. I do not want to use the pandas data frame.
My dictionary look like:-
{'[email protected]': {'Date': datetime.date(2019, 10, 21),'idle_time': datetime.datetime(2019, 10, 21, 1, 50)},
'[email protected]': {'Date': datetime.date(2019, 10, 21),'idle_time': datetime.datetime(2019, 10, 21, 1, 35)},
'[email protected]': {'Date': datetime.date(2019, 10, 21),'idle_time': datetime.datetime(2019, 10, 21, 1, 55)}
}
I want to convert this dict to spark data frame using pyspark data frame.
My expected output:-
Date idle_time
user_name
[email protected] 2019-10-21 2019-10-21 01:50:00
[email protected] 2019-10-21 2019-10-21 01:35:00
[email protected] 2019-10-21 2019-10-21 01:55:00
Upvotes: 2
Views: 6906
Reputation: 3344
Convert the dictionary to a list of tuples, each tuple will then become a row in Spark DataFrame:
rows = []
for key, value in data.items():
row = (key,value['Date'], value['idle_time'])
rows.append(row)
Define schema for your data:
from pyspark.sql.types import *
sch = StructType([
StructField('user_name', StringType()),
StructField('date', DateType()),
StructField('idle_time', TimestampType())
])
Create the Spark DataFrame:
df = spark.createDataFrame(rows, sch)
df.show()
+--------------------+----------+-------------------+
| user_name| date| idle_time|
+--------------------+----------+-------------------+
|prathameshsalap@g...|2019-10-21|2019-10-21 01:50:00|
|vaishusawant143@g...|2019-10-21|2019-10-21 01:35:00|
| [email protected]|2019-10-21|2019-10-21 01:55:00|
+--------------------+----------+-------------------+
Upvotes: 2
Reputation: 541
You need to redo your dictionary and build rows to properly infer the schema.
import datetime
from pyspark.sql import Row
data_dict = {
'[email protected]': {
'Date': datetime.date(2019, 10, 21),
'idle_time': datetime.datetime(2019, 10, 21, 1, 50)
},
'[email protected]': {
'Date': datetime.date(2019, 10, 21),
'idle_time': datetime.datetime(2019, 10, 21, 1, 35)
},
'[email protected]': {
'Date': datetime.date(2019, 10, 21),
'idle_time': datetime.datetime(2019, 10, 21, 1, 55)
}
}
data_as_rows = [Row(**{'user_name': k, **v}) for k,v in data_dict.items()]
data_df = spark.createDataFrame(data_as_rows).select('user_name', 'Date', 'idle_time')
data_df.show(truncate=False)
>>>
+-------------------------+----------+-------------------+
|user_name |Date |idle_time |
+-------------------------+----------+-------------------+
|[email protected]|2019-10-21|2019-10-21 01:50:00|
|[email protected]|2019-10-21|2019-10-21 01:35:00|
|[email protected] |2019-10-21|2019-10-21 01:55:00|
+-------------------------+----------+-------------------+
Note: if you already have the schema prepared and don't need to infer, you can just supply the schema to the createDataFrame function:
import pyspark.sql.types as T
schema = T.StructType([
T.StructField('user_name', T.StringType(), False),
T.StructField('Date', T.DateType(), False),
T.StructField('idle_time', T.TimestampType(), False)
])
data_as_tuples = [(k, v['Date'], v['idle_time']) for k,v in data_dict.items()]
data_df = spark.createDataFrame(data_as_tuples, schema=schema)
data_df.show(truncate=False)
>>>
+-------------------------+----------+-------------------+
|user_name |Date |idle_time |
+-------------------------+----------+-------------------+
|[email protected]|2019-10-21|2019-10-21 01:50:00|
|[email protected]|2019-10-21|2019-10-21 01:35:00|
|[email protected] |2019-10-21|2019-10-21 01:55:00|
+-------------------------+----------+-------------------+
Upvotes: 4