Reputation: 479
I have a directory called data
. Within this directory there are four subdirectories: 01
, 02
, 03
and 04
. Within these directories are hundres of JSON files that I want to load into a spark dataframe per subdirectory. What is the best way to do this?
I've tried this so far:
directories = ['01', '02', '03', '04']
for files in directories:
filepath = '/home/jovyan/data/{}/*.json.gz'
df = spark.read.format('json').option("header", "true").schema(schema).load(filepath)
# exectute rest code here
Upvotes: 1
Views: 2076
Reputation: 5070
You can use os.walk()
to find all files and directories in your data
folder recursively. For example, if in the future, you add a new folder 07
, you don't have to change your current code.
import os
path = './data/'
for root, directories, files in os.walk(path):
for file in files:
filepath = os.path.join(root, file)
if filepath.endswith('.json') or filepath.endswith('.json.gz'):
df = spark.read.format('json').option("header", "true").schema(schema).load(filepath)
# exectute rest code here
Upvotes: 3