Reputation: 563
I know this is a rather basic question, but I for my life I cant get it to work and I am running out of time, so: I have a dict that looks like this
data_dict = {'timestamp': '2019-05-01T06:00:00-04:00', 'data': [0.37948282157787916, 1.5890471705541012, 2.1883813840381885], '_id': '62377385587e549976adfda0'}
How can I create a dataframe from it? I tried:
schema = StructType([
StructField('timestamp', TimestampType(), True),
StructField('data', ArrayType(DecimalType()), True),
StructField('_id', StringType(), True)
])
df = spark.createDataFrame(data=data_dict, schema=schema)
this gives me error:
TypeError: StructType can not accept object 'timestamp' in type <class 'str'>
But even when I shrink the dict and take out the timestamp from the dict and the schema, I get a similar error:
TypeError: StructType can not accept object 'data' in type <class 'str'>
Any help greatly appreciated, thanks a lot in advance!
Edit: I just figured out, that by just putting [] around the dict, I get it to work. However, if anyone has a less ugly solution, Ill buy it
Upvotes: 0
Views: 190
Reputation: 144
You can cast the columns to desired type once you have the data in the df and then if needed explode the data column further to spread the values of the arrays in columns.
data_dict = {'timestamp': '2019-05-01T06:00:00-04:00', 'data': [0.37948282157787916, 1.5890471705541012, 2.1883813840381885], '_id': '62377385587e549976adfda0'}
df=spark.createDataFrame([data_dict]).select('_id',explode('data').alias('data'),col('timestamp').cast(TimestampType()))
_id | data | timestamp |
---|---|---|
62377385587e549976adfda0 | 0.37948282157787916 | 2019-05-01T06:00:00-04:00 |
62377385587e549976adfda0 | 1.5890471705541012 | 2019-05-01T06:00:00-04:00 |
62377385587e549976adfda0 | 2.1883813840381885 | 2019-05-01T06:00:00-04:00 |
Upvotes: 1