How to use Dask to process a file (or files) in multiple stages

Question

I'm processing a large text file in memory in 3 stages (currently not using pandas/dataframes)

This takes one raw data text file and processes it in four stages.

Stage 1 processes raw.txt and kicks out stage1.txt
Stage 2 processes stage1.txt and kicks out stage2.txt
Stage 3 processes stage2.txt and kicks out results.txt

How should I set a Dask script to work on this locally? Beyond this, how would can you set it up to work with multiple of raw.txt. (i.e. raw1, raw2, raw3)

At the moment each stage method does not return anything but writes the next file to a specific file location which the next method knows about.

def stage_1():
    outputFile=r"C:\Data\Processed\stage_1.txt"

    inputFile=r"C:\Data\RawData\rawData.txt"

    f1 = open(outputFile,"w+")
    f2 = open(inputFile,'r')

    #Process input file f2
    #Write results to f1

    f2.close()
    f1.close()

if __name__ == "__main__":
    stage_1()
    stage_2()
    stage_3()

MRocklin · Accepted Answer

I suspect you'll run into a few issues.

Function Purity

Dask generally assumes that functions are pure rather than rely on side effects. If you want to use Dask then I recommend that you change your functions so that they return data rather than produce files.

As a hacky workaround you could pass filenames between functions.

No Parallelism

The workflow you've described has no intrinsic parallelism. You can have dask run your functions but it will just run them one after the other. You would need to think about how to break open your computation a bit so that there are several function calls that could run in parallel. Dask will not do this thinking for you.

How to use Dask to process a file (or files) in multiple stages

Answers (1)

Function Purity

No Parallelism

Related Questions