Tshilidzi Mudau
Tshilidzi Mudau

Reputation: 7869

Pyspark "cannot resolve '`keyName_1`' given input columns: [keyName_1, keyName_2, keyName_3]\n" when reading a Json file

I'm reading from a json file using pyspark as follows:

raw = sc.textFile(path)
dataset_df = sqlContext.read.json(raw)

So to select only specific keys from the json file (if the key is present), I use:

dataset_df.select('countryName', 'city', 'age')

However, I get the following error from running the line above notice the `` characters around the input column giving an error, I didnt specify those when I listed the keys I want to pick from the json file):

"cannot resolve ' ``countryName `' given input columns: [countryName', 'city', "age"]\n"

I get a similar error when I remove countryName from the list of keys to read from the csv. I have tested on other keys from the json file, for some, the code above runs without issues but for specific columns I get the error shown above.

Does anyone know what could be the reason behind this?

Thanks in advance.

Upvotes: 0

Views: 453

Answers (1)

Tshilidzi Mudau
Tshilidzi Mudau

Reputation: 7869

At last, I have found the solution:

So the problem is being caused by the fact that, some of the json files I'm reading from might not have all the keys I'm looking for. It is in this cases where the file doesn't have a particular key that, I get the error I reported. To get around this, I just have to check to see if a key was found in this particular json file. If it isn't found, I replace it by None (this could be any value that I am using to indicate a missing value.).

Here is the resulting code:

raw = sc.textFile(path)
dataset_df = sqlContext.read.json(raw)

all_columns_being_used = ["countryName", 'city', "age"]

for column_name in all_columns_being_used:
                if not column_name in dataset_df.columns:
                    pre_feature_set = dataset_df.withColumn(column_name, lit(None))

Upvotes: 1

Related Questions