sdwww
sdwww

Reputation: 13

create pyspark dataframe with json string values and schema

I am trying to manually create some dummy pyspark dataframe.

I did the following:

from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [('{"Time":"2020-08-01T08:14:20.650Z","version":null}')
            ]

schema = StructType([ \
    StructField("raw_json",StringType(),True)
  ])

df = spark.createDataFrame(data=data2,schema=schema)
df.printSchema()
df.show(truncate=False)

but i got the error:

TypeError: StructType can not accept object '[{"Time:"2020-08-01T08:14:20.650Z","version":null}]' in type <class 'str'>

How am i able to put json string into pyspark dataframe as values?

my ideal result is:

+-----------------------------------------------------------------+
|value                                                             |             
+-----------------------------------------------------------------------
| {"Time":"2020-08-01T08:14:20.650Z","version":null}|

Upvotes: 1

Views: 1187

Answers (3)

ags29
ags29

Reputation: 2696

Try this:

import json

rdd = sc.parallelize(data2).map(lambda x: [json.loads(x)]).toDF(schema=['raw_json'])

Upvotes: 0

mck
mck

Reputation: 42352

It could also work if you specify data2 as a list of tuples, by adding a trailing comma inside the parentheses to specify that it is a tuple.

from pyspark.sql.types import *

# Note the trailing comma inside the parentheses
data2 = [('{"applicationTimeStamp":"2020-08-01T08:14:20.650Z","version":null}',)]

schema = StructType([
    StructField("raw_json",StringType(),True)
])

df = spark.createDataFrame(data=data2,schema=schema)
df.show(truncate=False)
+------------------------------------------------------------------+
|raw_json                                                          |
+------------------------------------------------------------------+
|{"applicationTimeStamp":"2020-08-01T08:14:20.650Z","version":null}|
+------------------------------------------------------------------+

Upvotes: 0

Cena
Cena

Reputation: 3419

The error is because of your braces. data2 should have list of lists - so replace inner parenthesis with square brackets:

data2 = [['{"applicationTimeStamp":"2020-08-01T08:14:20.650Z","version":null}']]

schema = StructType([StructField("raw_json",StringType(),True)])
df = spark.createDataFrame(data=data2,schema=schema)

df.show(truncate=False)
+------------------------------------------------------------------+            
|raw_json                                                          |
+------------------------------------------------------------------+
|{"applicationTimeStamp":"2020-08-01T08:14:20.650Z","version":null}|
+------------------------------------------------------------------+

Upvotes: 1

Related Questions