Failing to put data into desired Schema in pyspark

Question

I have pyspark dataframe which looks as follow

>>> df.show(1, False)                                                           
{"data":{"probability":0.2345,"customerId":1234567,"region":"BR"},"uploadedDate":1542548806295}

Above is output when i dont pass any schema as input...

I am trying following script to load data with schema mentioned.

SCHEMA = StructType([ StructField('probabilityMale',LongType(),True),\
                    StructField('customerId',LongType(),True),\
                    StructField('region',StringType(),True),\
                    StructField('uploadedDate',StringType(),True)])

df = spark.read.format('csv').\
     option('header','false').\
     option('delimiter','	').\
     schema(SCHEMA).\
     load(path)

But this doesnt give all datapoints in separate column. I also tried with inferSchema.

df = spark.read.format('csv').\
     option('header','false').\
     option('delimiter','	').\
     option("inferSchema", "true").\
     load(path)

But getting same output as mentioned earlier...

How can I mention schema and have data in each column?

mck · Accepted Answer

You have a JSON input, which you should read with the JSON reader, not the CSV reader:

df = spark.read.json(path)

And to get the columns separately, you can expand the struct data:

df2 = df.select('data.*', 'uploadedDate')

Failing to put data into desired Schema in pyspark

Answers (1)

Related Questions