Reputation: 3931
I'm trying to create a manual schema for a dataframe
. The data I am passing in is an RDD created from json
. Here is my initial data:
json2 = sc.parallelize(['{"name": "mission", "pandas": {"attributes": "[0.4, 0.5]", "pt": "giant", "id": "1", "zip": "94110", "happy": "True"}}'])
Then here is how schema is specified:
schema = StructType(fields=[
StructField(
name='name',
dataType=StringType(),
nullable=True
),
StructField(
name='pandas',
dataType=ArrayType(
StructType(
fields=[
StructField(
name='id',
dataType=StringType(),
nullable=False
),
StructField(
name='zip',
dataType=StringType(),
nullable=True
),
StructField(
name='pt',
dataType=StringType(),
nullable=True
),
StructField(
name='happy',
dataType=BooleanType(),
nullable=False
),
StructField(
name='attributes',
dataType=ArrayType(
elementType=DoubleType(),
containsNull=False
),
nullable=True
)
]
),
containsNull=True
),
nullable=True
)
])
When I use sqlContext.createDataFrame(json2, schema)
and then try to do a show()
on the resulting dataframe
I receive the following error:
ValueError: Unexpected tuple '{"name": "mission", "pandas": {"attributes": "[0.4, 0.5]", "pt": "giant", "id": "1", "zip": "94110", "happy": "True"}}' with StructType
Upvotes: 4
Views: 10601
Reputation: 330093
First of all json2
is just a RDD[String]
. Spark has no special knowledge about serialization format used to encode the data. Moreover it expects a RDD
or Row
or some product and it is clearly not the case.
In Scala you could use
sqlContext.read.schema(schema).json(rdd)
with RDD[String]
but there are two problems:
even if it was schema you've created is simply invalid:
pandas
is a struct
not and array
pandas.happy
is not a string
a boolean
pandas.attributes
is string
not array
Schema is used only to avoid type inference a not for type casting or any other transformations. If you want to transform data you'll have to parse it first:
def parse(s: str) -> Row:
return ...
rdd.map(parse).toDF(schema)
Assuming that the you have JSON like this (fixed types):
{"name": "mission", "pandas": {"attributes": [0.4, 0.5], "pt": "giant", "id": "1", "zip": "94110", "happy": true}}
correct schema would look as follows
StructType([
StructField("name", StringType(), True),
StructField("pandas", StructType([
StructField("attributes", ArrayType(DoubleType(), True), True),
StructField("happy", BooleanType(), True),
StructField("id", StringType(), True),
StructField("pt", StringType(), True),
StructField("zip", StringType(), True))],
True)])
Upvotes: 5