Read multiple json files in different folders in separate spark dataframe

Question

I have a directory called data. Within this directory there are four subdirectories: 01, 02, 03 and 04. Within these directories are hundres of JSON files that I want to load into a spark dataframe per subdirectory. What is the best way to do this?

I've tried this so far:

directories = ['01', '02', '03', '04']
for files in directories:
    filepath = '/home/jovyan/data/{}/*.json.gz'
    df = spark.read.format('json').option("header", "true").schema(schema).load(filepath)
    # exectute rest code here

aminrd · Accepted Answer

You can use os.walk() to find all files and directories in your data folder recursively. For example, if in the future, you add a new folder 07, you don't have to change your current code.

import os

path = './data/'
for root, directories, files in os.walk(path):
    for file in files:
        filepath = os.path.join(root, file)
        if filepath.endswith('.json') or filepath.endswith('.json.gz'):
            df = spark.read.format('json').option("header", "true").schema(schema).load(filepath)
            # exectute rest code here

Read multiple json files in different folders in separate spark dataframe

Answers (1)

Related Questions