Tim Vivian-Griffiths
Tim Vivian-Griffiths

Reputation: 131

Null values when loading in multiple JSON objects in single line file into PySpark

I need some help with getting a series of JSON files from an S3 bucket into a PySpark DataFrame.

The files in this bucket all have a .json extension, but unfortunately do not adhere to the usual Spark requirement of having one JSON object per line, and instead, they are all on one line inside square brackets.

So instead of:

{"first_column": 12, "second_column": {"nested_column": "value"}}
{"first_column": 24, "second_column": {"nested_column": "value2"}}

I have:

[{"first_column": 12, "second_column": {"nested_column": "value"}},{"first_column": 24, "second_column": {"nested_column": "value2"}}]

We actually receive the files in this format, and there are so many of them, that it is not feasible to make any manual adjustments unfortunately.

The approaches that I have tried so far are as follows:

I've tried to use the spark.read.json method, using the following syntax with the wildcard * to load in multiple files at once. In this case spark is the sqlContext

df = spark.read.json("s3://path_to_bucket_*.json") 

This runs without raising any errors or warnings, and returns the desired schema:

df.printSchema()

root
 |-- first_column: long (nullable = true)
 |-- second_column: struct (nullable = true)
 |     |-- nested_column: string (nullable = true)

However, when I try to view the data I get the following:

+------------+-------------+
|first_column|second_column|
+------------+-------------+
|        null|         null|
+------------+-------------+

I have found a way of actually loading the data in from Databricks here which uses a Spark context sc to read in a Paired RDD as follows:

dfRDD = sc.wholeTextFiles("s3://path_to_bucket_*.json")

This returns a PairedRDD with the filename, and the file body. However, what is not really confusing me is that when I call the following line using the body information from this RDD, it works fine and there are no null values at all:

df = spark.read.json(dfRDD.map(lambda x: x[1]))

So, I am very confused as to why this is happening, as I would have thought that this was the same information being fed into the function as the body of the text in the RDD does not contain any line breaks, and instead contains JSON objects within square brackets (like the second example I showed above).

While this is a workaround, it is unfortunately lacking; first of all, it is much slower to use the RDD method, and more importantly, I need to get a column of the name of the file where I got this information from. I know that this is possible using the input_file_name function from the pyspark.sql.functions module when loading straight in from the files, but this does not work when using the RDD method. I have managed to write a pure Python function that gets the filename information from the first element of each pairedRDD into the JSON string, but this is painfully slow.

If anyone can help me out with this I would be very grateful. I appreciate that I might have to use the RDD method, but I am confused as to why the spark.read.json works perfectly in the one situation, and not the other.

Upvotes: 3

Views: 1698

Answers (1)

mayank agrawal
mayank agrawal

Reputation: 2545

Although I am not sure what caused one solution to work and another not, I was able to solve the problem to some extent by only using sql.read.json.

Set parameters allowComments, allowUnquotedFieldNames, allowSingleQuotes, allowNumericLeadingZero, allowBackslashEscapingAnyCharacter in read.json as True. By this, I was able to remove null values and 90 % of the data was converted successfully in a dataframe without any null values.

Check out other parameters here

Upvotes: 1

Related Questions