Psychotechnopath
Psychotechnopath

Reputation: 2744

spark.read.json() taking extremely long to load data

What I've Tried

I have JSON data which comes from an API. I saved all the data into a single directory. Now I am trying to load this data into a spark dataframe, so I can do ETL on it. The API returned fragmented data (per division). Moreover, some divisions had little data and some had a lot. The directory I'm trying to load data from looks as follows:

* json_file_1 - 109MB
* json_file_2 - 2.2MB
* json_file_3 - 67MB
* json_file_4 - 105MB
* json_file_5 - **2GB**
* json_file_6 - 15MB
* json_file_7 - 265MB
* json_file_8 - 35MB
* json_file_9 - **500KB**
* json_file_10 - 383MB

I'm using Azure Synapse and an Apache Spark Pool, the data directory i'm loading from resides in an ADLS2 data lake. I'm using the following code to load all data files that reside in the directory. For other projects this code works fine and fast.

blob_path_raw = 'abfss://blob_container@my_data_lake.dfs.core.windows.net'
df = spark.read.json(path=f"{blob_path_raw}/path_to_directory_described_above")

My Question

The code above is taking extremely long to run (at the time of writing already more than 3 hours), and I suspect it got stuck somewhere, as loading +-4GB of data is something a Spark pool should be easily able to do. I suspect something is going wrong in Spark because of the heterogenous sizes of data files. But I am still rather novice in Spark as we just migrated to Azure Synapse. What is going wrong here, and how do I debug it?

Upvotes: 0

Views: 2911

Answers (1)

Psychotechnopath
Psychotechnopath

Reputation: 2744

I found the problem. 9 out of 10 files were JSON Line format, so every line was a valid JSON object. Example below:

{"name": "Gilbert", "wins": [["straight", "7♣"]]}
{"name": "Alexa", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}
{"name": "May", "wins": []}
{"name": "Deloise", "wins": [["three of a kind", "5♣"]]}

While the big file ( I got it through another way than the other 9 files) was a regular JSON file format, where every JSON object was an array element, and every JSON object was spanning multiple lines:

[
  {
    "name": "Gilbert",
    "wins": [
      [
        "straight",
        "7♣"
      ]
    ]
  },
  {
    "name": "Alexa",
    "wins": [
      [
        "two pair",
        "4♠"
      ],
      [
        "two pair",
        "9♠"
      ]
    ]]

As per Spark documentation (https://spark.apache.org/docs/latest/sql-data-sources-json.html) one needs to specify multiline=true option to read multiline JSON. When not specifying this option, Spark expects that each line in your JSON file contains a seperate, self-contained, valid JSON object. Therefore, mixing multiline JSON and JSONL files is a bad idea/impossible.

TLDR Solution Changing the big file from multiline JSON to JSONL fixed the problem and dataframe is now loading fine. I used the following code to do this:

import json

with open("multilines_json_file.json", "r") as f:
    python_obj = json.load(f)


def dump_jsonl(data, output_path, append=False):
    """
    Write list of objects to a JSON lines file.
    """
    mode = 'a+' if append else 'w'
    with open(output_path, mode, encoding='utf-8') as f:
        for line in data:
            json_record = json.dumps(line, ensure_ascii=False)
            f.write(json_record + '\n')
    print('Wrote {} records to {}'.format(len(data), output_path))


dump_jsonl(python_obj, 'multilines_json_file.json', append=True)

Upvotes: 2

Related Questions