Reputation: 2083
I am trying to convert a dictionary:
data_dict = {'t1': '1', 't2': '2', 't3': '3'}
into a dataframe:
key | value|
----------------
t1 1
t2 2
t3 3
To do that, I tried:
schema = StructType([StructField("key", StringType(), True), StructField("value", StringType(), True)])
ddf = spark.createDataFrame(data_dict, schema)
But I got the below error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Cellar/apache-spark/2.4.5/libexec/python/pyspark/sql/session.py", line 748, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "/usr/local/Cellar/apache-spark/2.4.5/libexec/python/pyspark/sql/session.py", line 413, in _createFromLocal
data = list(data)
File "/usr/local/Cellar/apache-spark/2.4.5/libexec/python/pyspark/sql/session.py", line 730, in prepare
verify_func(obj)
File "/usr/local/Cellar/apache-spark/2.4.5/libexec/python/pyspark/sql/types.py", line 1389, in verify
verify_value(obj)
File "/usr/local/Cellar/apache-spark/2.4.5/libexec/python/pyspark/sql/types.py", line 1377, in verify_struct
% (obj, type(obj))))
TypeError: StructType can not accept object 't1' in type <class 'str'>
So I tried this without specifying any schema but just the column datatypes:
ddf = spark.createDataFrame(data_dict, StringType()
& ddf = spark.createDataFrame(data_dict, StringType(), StringType())
But both result in a dataframe with one column which is key of the dictionary as below:
+-----+
|value|
+-----+
|t1 |
|t2 |
|t3 |
+-----+
Could anyone let me know how to convert a dictionary into a spark dataframe in PySpark ?
Upvotes: 9
Views: 12885
Reputation: 1
considering data
as dict:
df = spark.createDataFrame(
data=np.array(list(data.values())).T.tolist(),
schema=list(data.keys())
)
Upvotes: 0
Reputation: 1
You can make a list of dictionaries, like that:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
{"deptId": 1, "age": 40},
{"deptId": 2, "age" 50},
])
df.show()
Upvotes: 0
Reputation: 754
I just want to add that if you have a dictionary that has pair col: list[vals]
for instance:
{
"col1" : [1,2,3],
"col2" : ["a", "b", "c"]
}
A possible solution is:
columns = list(raw_data.keys())
data = [[*vals] for vals in zip(*raw_data.values())]
df = spark.createDataFrame(data, columns)
But I'm new to pyspark, I guess there is even a better way to do this?
Upvotes: 0
Reputation: 45339
You can use data_dict.items()
to list key/value pairs:
spark.createDataFrame(data_dict.items()).show()
Which prints
+---+---+
| _1| _2|
+---+---+
| t1| 1|
| t2| 2|
| t3| 3|
+---+---+
Of course, you can specify your schema:
spark.createDataFrame(data_dict.items(),
schema=StructType(fields=[
StructField("key", StringType()),
StructField("value", StringType())])).show()
Resulting in
+---+-----+
|key|value|
+---+-----+
| t1| 1|
| t2| 2|
| t3| 3|
+---+-----+
Upvotes: 11