jsondataframeapache-sparkpysparktypeerror

Reputation: 563

PySpark: TypeError: col should be Column

I am trying to create a dataframe out of a nested JSON structure, but I am encountering a problem that I don't understand. I have exploded an array-of-dicts structure in the JSON and now I am trying to access these dicts and create columns with the values in there. This is how the dicts look like:

The values at index 1 (subject, glocations etc.) go under the key "name" according to the schema:

However, when I try:

dataframe = dataframe.withColumn("keywords_name", dataframe.keywords_exp.name)

it throws error:

PySpark: TypeError: col should be Column

There is no such problem with any other of the keys in the dict, i.e. "value".

I really do not understand the problem, do I have to assume that there are inconsistencies in the data? If yes, can you recommend a way to check for or even dodge them?

Edit: Khalid had a good idea to pre-define the schema. I tried to do so by storing one of the JSON files as a kind of default file. From that file, I wanted to extract the schema as follows:

schemapath = 'default_schema.json'
with open(schemapath) as f:
  d = json.load(f)
  schemaNew = StructType.fromJson(d)
  responseDf = spark.read.schema(schemaNew).json("apiResponse.json", multiLine=True)

however, line

schemaNew = StructType.fromJson(d)

throws following error:

KeyError: 'fields'

No idea, where this 'fields' is coming from...

Upvotes: 3

Answers (3)

ZygD

Reputation: 24478

Errors in Spark tell truth.

dataframe.withColumn("keywords_name", dataframe.keywords_exp.name)

TypeError: col should be Column

DataFrame.withColumn documentation tells you how its input parameters are called and their data types:

Parameters:
- colName: str
string, name of the new column.
- col: Column
a Column expression for the new column.

So, col is parameter's name and Column is its type. Column is the data type which withColumn expects to get as the parameter named col. What did it actually receive? It received dataframe.keywords_exp.name. But what data type is it of?

print(type(dataframe.keywords_exp.name))
# <class 'method'>

As can be seen, it's not of the expected type Column...

To get Column from Struct's field, you must use a different syntax.

Note: data types in the dataframe are not what you think they are. You don't have dicts anymore. Instead, you have a Struct type column. The keys from the old dictionaries are now Field names for Struct type column.

To access struct fields, you should be using any of the following options:

df = dataframe.withColumn("keywords_name", F.col("keywords_exp.name"))

df = dataframe.withColumn("keywords_name", dataframe.keywords_exp['name'])

(Both, F.col("keywords_exp.name") and dataframe.keywords_exp['name'] are of type Column.)

This is a dataframe having the same schema as yours. You can see that withColumn works well:

from pyspark.sql import functions as F
dataframe = spark.createDataFrame(
    [(("N", "glocations", 1, "Cuba"),)],
    'keywords_exp struct<major:string,name:string,rank:bigint,value:string>')
dataframe.printSchema()
# root
#  |-- keywords_exp: struct (nullable = true)
#  |    |-- major: string (nullable = true)
#  |    |-- name: string (nullable = true)
#  |    |-- rank: long (nullable = true)
#  |    |-- value: string (nullable = true)

df = dataframe.withColumn("keywords_name", F.col("keywords_exp.name"))

df.show()
# +--------------------+-------------+
# |        keywords_exp|keywords_name|
# +--------------------+-------------+
# |{N, glocations, 1...|   glocations|
# +--------------------+-------------+

Upvotes: 6

Moritz

Reputation: 563

the Spark API seems to have problems with certain protected words. I came across this link when googling the error message

AttributeError: ‘function’ object has no attribute

https://learn.microsoft.com/en-us/azure/databricks/kb/python/function-object-no-attribute

Though "name" is not on the list, I changed all "name"-occurences in the JSON to "nameabcde" and now I can access it:

Upvotes: -1

Khalid Mammadov

Reputation: 541

Try setting scheme before reading. Edit: I think the json schema needs to be in specific format. I know it's not documented very well, but you can extract an example using .json() method to see the format and then adjust your schema files. See below updated example:

aa.json

[{"keyword_exp": {"name": "aa", "value": "bb"}}, {"keyword_exp": {"name": "oo", "value": "ee"}}]

test.py

from pyspark.sql.session import SparkSession
import json


if __name__ == '__main__':
    spark = SparkSession.builder.appName("test-app").master("local[1]").getOrCreate()

    from pyspark.sql.types import StructType, StructField, StringType
    schema = StructType([
        StructField('keyword_exp', StructType([
            StructField('name', StringType(), False),
            StructField('value', StringType(), False),
        ])),
    ])

    json_str = schema.json()
    json_obj = json.loads(json_str)
    # Save output of this as file
    print(json_str)
    # Just to see it pretty
    print(json.dumps(json_obj, indent=4))
    # Save to file
    with open("file_schema.json", "w") as f:
        f.write(json_str)
    # Load
    with open("file_schema.json", "r") as f:
        scheme_obj = json.loads(f.read())
    # Re-load
    loaded_schema = StructType.fromJson(scheme_obj)

    df = spark.read.json("./aa.json", schema=schema)
    df.printSchema()
    df = df.select("keyword_exp.name", "keyword_exp.value")
    df.show()

output:

{"fields":[{"metadata":{},"name":"keyword_exp","nullable":true,"type":{"fields":[{"metadata":{},"name":"name","nullable":false,"type":"string"},{"metadata":{},"name":"value","nullable":false,"type":"string"}],"type":"struct"}}],"type":"struct"}


{
    "fields": [
        {
            "metadata": {},
            "name": "keyword_exp",
            "nullable": true,
            "type": {
                "fields": [
                    {
                        "metadata": {},
                        "name": "name",
                        "nullable": false,
                        "type": "string"
                    },
                    {
                        "metadata": {},
                        "name": "value",
                        "nullable": false,
                        "type": "string"
                    }
                ],
                "type": "struct"
            }
        }
    ],
    "type": "struct"
}
root
 |-- keyword_exp: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- value: string (nullable = true)

+----+-----+
|name|value|
+----+-----+
|  aa|   bb|
|  oo|   ee|
+----+-----+

Upvotes: 2

PySpark: TypeError: col should be Column

Answers (3)

Related Questions