Reputation: 1941
All -
I have millions of single json files, and I want to ingest all into a Spark dataframe. However, I didn't see a append
call, where I can append json as additions. Instead, the only way I can make it work is:
for all json files do:
df_tmp = spark.read.json("/path/to/jsonfile", schema=my_schema)
df = df.union(df_tmp)
df
is the final aggregated dataframe. This approach works a few hundreds files, but as it approaches thousands, it is getting slower and slower. I suspect this cost of dataframe create and merge are signficant, and it feels awkward as well. Is there a better approach? TIA
Upvotes: 0
Views: 140
Reputation: 758
You can just pass the path to the folder instead of individual file and it will read all the files in it.
For example, your files are in a folder called JsonFiles
, you can write,
df = spark.read.json("/path/to/JsonFiles/")
df.show()
Upvotes: 1