Null values when loading in multiple JSON objects in single line file into PySpark

Question

I need some help with getting a series of JSON files from an S3 bucket into a PySpark DataFrame.

The files in this bucket all have a .json extension, but unfortunately do not adhere to the usual Spark requirement of having one JSON object per line, and instead, they are all on one line inside square brackets.

So instead of:

{"first_column": 12, "second_column": {"nested_column": "value"}}
{"first_column": 24, "second_column": {"nested_column": "value2"}}

I have:

[{"first_column": 12, "second_column": {"nested_column": "value"}},{"first_column": 24, "second_column": {"nested_column": "value2"}}]

We actually receive the files in this format, and there are so many of them, that it is not feasible to make any manual adjustments unfortunately.

The approaches that I have tried so far are as follows:

I've tried to use the spark.read.json method, using the following syntax with the wildcard * to load in multiple files at once. In this case spark is the sqlContext

df = spark.read.json("s3://path_to_bucket_*.json")

This runs without raising any errors or warnings, and returns the desired schema:

df.printSchema()

root
 |-- first_column: long (nullable = true)
 |-- second_column: struct (nullable = true)
 |     |-- nested_column: string (nullable = true)

However, when I try to view the data I get the following:

+------------+-------------+
|first_column|second_column|
+------------+-------------+
|        null|         null|
+------------+-------------+

I have found a way of actually loading the data in from Databricks here which uses a Spark context sc to read in a Paired RDD as follows:

dfRDD = sc.wholeTextFiles("s3://path_to_bucket_*.json")

This returns a PairedRDD with the filename, and the file body. However, what is not really confusing me is that when I call the following line using the body information from this RDD, it works fine and there are no null values at all:

df = spark.read.json(dfRDD.map(lambda x: x[1]))

So, I am very confused as to why this is happening, as I would have thought that this was the same information being fed into the function as the body of the text in the RDD does not contain any line breaks, and instead contains JSON objects within square brackets (like the second example I showed above).

While this is a workaround, it is unfortunately lacking; first of all, it is much slower to use the RDD method, and more importantly, I need to get a column of the name of the file where I got this information from. I know that this is possible using the input_file_name function from the pyspark.sql.functions module when loading straight in from the files, but this does not work when using the RDD method. I have managed to write a pure Python function that gets the filename information from the first element of each pairedRDD into the JSON string, but this is painfully slow.

If anyone can help me out with this I would be very grateful. I appreciate that I might have to use the RDD method, but I am confused as to why the spark.read.json works perfectly in the one situation, and not the other.

Null values when loading in multiple JSON objects in single line file into PySpark

Answers (1)

Related Questions