Reputation: 13
I am trying to manually create some dummy pyspark dataframe.
I did the following:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [('{"Time":"2020-08-01T08:14:20.650Z","version":null}')
]
schema = StructType([ \
StructField("raw_json",StringType(),True)
])
df = spark.createDataFrame(data=data2,schema=schema)
df.printSchema()
df.show(truncate=False)
but i got the error:
TypeError: StructType can not accept object '[{"Time:"2020-08-01T08:14:20.650Z","version":null}]' in type <class 'str'>
How am i able to put json string into pyspark dataframe as values?
my ideal result is:
+-----------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------
| {"Time":"2020-08-01T08:14:20.650Z","version":null}|
Upvotes: 1
Views: 1187
Reputation: 2696
Try this:
import json
rdd = sc.parallelize(data2).map(lambda x: [json.loads(x)]).toDF(schema=['raw_json'])
Upvotes: 0
Reputation: 42352
It could also work if you specify data2 as a list of tuples, by adding a trailing comma inside the parentheses to specify that it is a tuple.
from pyspark.sql.types import *
# Note the trailing comma inside the parentheses
data2 = [('{"applicationTimeStamp":"2020-08-01T08:14:20.650Z","version":null}',)]
schema = StructType([
StructField("raw_json",StringType(),True)
])
df = spark.createDataFrame(data=data2,schema=schema)
df.show(truncate=False)
+------------------------------------------------------------------+
|raw_json |
+------------------------------------------------------------------+
|{"applicationTimeStamp":"2020-08-01T08:14:20.650Z","version":null}|
+------------------------------------------------------------------+
Upvotes: 0
Reputation: 3419
The error is because of your braces. data2
should have list of lists - so replace inner parenthesis with square brackets:
data2 = [['{"applicationTimeStamp":"2020-08-01T08:14:20.650Z","version":null}']]
schema = StructType([StructField("raw_json",StringType(),True)])
df = spark.createDataFrame(data=data2,schema=schema)
df.show(truncate=False)
+------------------------------------------------------------------+
|raw_json |
+------------------------------------------------------------------+
|{"applicationTimeStamp":"2020-08-01T08:14:20.650Z","version":null}|
+------------------------------------------------------------------+
Upvotes: 1