reading batch of json files into dataframe

Question

All -

I have millions of single json files, and I want to ingest all into a Spark dataframe. However, I didn't see a append call, where I can append json as additions. Instead, the only way I can make it work is:

for all json files do:
    df_tmp = spark.read.json("/path/to/jsonfile", schema=my_schema)
    df = df.union(df_tmp)

df is the final aggregated dataframe. This approach works a few hundreds files, but as it approaches thousands, it is getting slower and slower. I suspect this cost of dataframe create and merge are signficant, and it feels awkward as well. Is there a better approach? TIA

Tushar Patil · Accepted Answer

You can just pass the path to the folder instead of individual file and it will read all the files in it.

For example, your files are in a folder called JsonFiles, you can write,

df = spark.read.json("/path/to/JsonFiles/")

df.show()

reading batch of json files into dataframe

Answers (1)

Related Questions