Reputation: 25
I have pyspark dataframe which looks as follow
>>> df.show(1, False)
{"data":{"probability":0.2345,"customerId":1234567,"region":"BR"},"uploadedDate":1542548806295}
Above is output when i dont pass any schema as input...
I am trying following script to load data with schema mentioned.
SCHEMA = StructType([ StructField('probabilityMale',LongType(),True),\
StructField('customerId',LongType(),True),\
StructField('region',StringType(),True),\
StructField('uploadedDate',StringType(),True)])
df = spark.read.format('csv').\
option('header','false').\
option('delimiter','\t').\
schema(SCHEMA).\
load(path)
But this doesnt give all datapoints in separate column. I also tried with inferSchema
.
df = spark.read.format('csv').\
option('header','false').\
option('delimiter','\t').\
option("inferSchema", "true").\
load(path)
But getting same output as mentioned earlier...
How can I mention schema and have data in each column?
Upvotes: 1
Views: 67
Reputation: 42422
You have a JSON input, which you should read with the JSON reader, not the CSV reader:
df = spark.read.json(path)
And to get the columns separately, you can expand the struct data
:
df2 = df.select('data.*', 'uploadedDate')
Upvotes: 1