Error_2646
Error_2646

Reputation: 3849

Databricks parallelize file system operations

This question applies broadly but the specific use case is concatenating all fragmented files from a number of directories.

The crux of the question is optimizing/inspecting parallelism with with how Databricks performs file system operations.

Notes:

    fragmented_files/
        entity_1/
            fragment_1.csv
            ...
            fragment_n.csv
        ...
        entity_n/
            fragment_1.csv
            ...
            fragment_n.csv

    merged_files/
        entity_1/
            merged.csv
        ...
        entity_n/
            merged.csv

I have working code, the gist of which is

def concat_files(fragged_dir, new):
    with open(new) as nout:
        for orig_frag_file in fragged_dir:
            with open(orig_frag_file) as o_file:
                nout.write(o_file)

with concurrent.futures.ThreadPoolExecutor() as executor:
  results = executor.map(concat_files, all_directories_with_fragmented_files)

Questions:

Upvotes: 1

Views: 98

Answers (0)

Related Questions