flybonzai
flybonzai

Reputation: 3931

DataFrame - ValueError: Unexpected tuple with StructType

I'm trying to create a manual schema for a dataframe. The data I am passing in is an RDD created from json. Here is my initial data:

json2 = sc.parallelize(['{"name": "mission", "pandas": {"attributes": "[0.4, 0.5]", "pt": "giant", "id": "1", "zip": "94110", "happy": "True"}}'])

Then here is how schema is specified:

schema = StructType(fields=[
    StructField(
        name='name',
        dataType=StringType(),
        nullable=True
    ),
    StructField(
        name='pandas',
        dataType=ArrayType(
            StructType(
                fields=[
                    StructField(
                        name='id',
                        dataType=StringType(),
                        nullable=False
                    ),
                    StructField(
                        name='zip',
                        dataType=StringType(),
                        nullable=True
                    ),
                    StructField(
                        name='pt',
                        dataType=StringType(),
                        nullable=True
                    ),
                    StructField(
                        name='happy',
                        dataType=BooleanType(),
                        nullable=False
                    ),
                    StructField(
                        name='attributes',
                        dataType=ArrayType(
                            elementType=DoubleType(),
                            containsNull=False
                        ),
                        nullable=True

                    )
                ]
            ),
            containsNull=True
        ),
        nullable=True
    )
])

When I use sqlContext.createDataFrame(json2, schema) and then try to do a show() on the resulting dataframe I receive the following error:

ValueError: Unexpected tuple '{"name": "mission", "pandas": {"attributes": "[0.4, 0.5]", "pt": "giant", "id": "1", "zip": "94110", "happy": "True"}}' with StructType

Upvotes: 4

Views: 10601

Answers (1)

zero323
zero323

Reputation: 330093

First of all json2 is just a RDD[String]. Spark has no special knowledge about serialization format used to encode the data. Moreover it expects a RDD or Row or some product and it is clearly not the case.

In Scala you could use

sqlContext.read.schema(schema).json(rdd) 

with RDD[String] but there are two problems:

  • this approach is not directly accessible in PySpark
  • even if it was schema you've created is simply invalid:

    • pandas is a struct not and array
    • pandas.happy is not a string a boolean
    • pandas.attributes is string not array

Schema is used only to avoid type inference a not for type casting or any other transformations. If you want to transform data you'll have to parse it first:

def parse(s: str) -> Row:
    return ...

rdd.map(parse).toDF(schema)

Assuming that the you have JSON like this (fixed types):

{"name": "mission", "pandas": {"attributes": [0.4, 0.5], "pt": "giant", "id": "1", "zip": "94110", "happy": true}} 

correct schema would look as follows

StructType([
    StructField("name", StringType(), True),
    StructField("pandas", StructType([
        StructField("attributes", ArrayType(DoubleType(), True), True),
        StructField("happy", BooleanType(), True),
        StructField("id", StringType(), True),
        StructField("pt", StringType(), True),
        StructField("zip", StringType(), True))],
    True)])

Upvotes: 5

Related Questions