Reputation: 131
I need some help with getting a series of JSON files from an S3 bucket into a PySpark DataFrame.
The files in this bucket all have a .json
extension, but unfortunately do not adhere to the usual Spark requirement of having one JSON object per line, and instead, they are all on one line inside square brackets.
So instead of:
{"first_column": 12, "second_column": {"nested_column": "value"}}
{"first_column": 24, "second_column": {"nested_column": "value2"}}
I have:
[{"first_column": 12, "second_column": {"nested_column": "value"}},{"first_column": 24, "second_column": {"nested_column": "value2"}}]
We actually receive the files in this format, and there are so many of them, that it is not feasible to make any manual adjustments unfortunately.
The approaches that I have tried so far are as follows:
I've tried to use the spark.read.json
method, using the following syntax with the wildcard *
to load in multiple files at once. In this case spark
is the sqlContext
df = spark.read.json("s3://path_to_bucket_*.json")
This runs without raising any errors or warnings, and returns the desired schema:
df.printSchema()
root
|-- first_column: long (nullable = true)
|-- second_column: struct (nullable = true)
| |-- nested_column: string (nullable = true)
However, when I try to view the data I get the following:
+------------+-------------+
|first_column|second_column|
+------------+-------------+
| null| null|
+------------+-------------+
I have found a way of actually loading the data in from Databricks here which uses a Spark context sc
to read in a Paired RDD as follows:
dfRDD = sc.wholeTextFiles("s3://path_to_bucket_*.json")
This returns a PairedRDD with the filename, and the file body. However, what is not really confusing me is that when I call the following line using the body information from this RDD, it works fine and there are no null values at all:
df = spark.read.json(dfRDD.map(lambda x: x[1]))
So, I am very confused as to why this is happening, as I would have thought that this was the same information being fed into the function as the body of the text in the RDD does not contain any line breaks, and instead contains JSON objects within square brackets (like the second example I showed above).
While this is a workaround, it is unfortunately lacking; first of all, it is much slower to use the RDD method, and more importantly, I need to get a column of the name of the file where I got this information from. I know that this is possible using the input_file_name
function from the pyspark.sql.functions
module when loading straight in from the files, but this does not work when using the RDD method. I have managed to write a pure Python function that gets the filename information from the first element of each pairedRDD into the JSON string, but this is painfully slow.
If anyone can help me out with this I would be very grateful. I appreciate that I might have to use the RDD method, but I am confused as to why the spark.read.json
works perfectly in the one situation, and not the other.
Upvotes: 3
Views: 1698
Reputation: 2545
Although I am not sure what caused one solution to work and another not, I was able to solve the problem to some extent by only using sql.read.json.
Set parameters allowComments, allowUnquotedFieldNames, allowSingleQuotes, allowNumericLeadingZero, allowBackslashEscapingAnyCharacter in read.json as True. By this, I was able to remove null values and 90 % of the data was converted successfully in a dataframe without any null values.
Check out other parameters here
Upvotes: 1