Mdev
Mdev

Reputation: 463

How to use Dask to process a file (or files) in multiple stages

I'm processing a large text file in memory in 3 stages (currently not using pandas/dataframes)

This takes one raw data text file and processes it in four stages.

How should I set a Dask script to work on this locally? Beyond this, how would can you set it up to work with multiple of raw.txt. (i.e. raw1, raw2, raw3)

At the moment each stage method does not return anything but writes the next file to a specific file location which the next method knows about.

def stage_1():
    outputFile=r"C:\Data\Processed\stage_1.txt"

    inputFile=r"C:\Data\RawData\rawData.txt"

    f1 = open(outputFile,"w+")
    f2 = open(inputFile,'r')

    #Process input file f2
    #Write results to f1

    f2.close()
    f1.close()

if __name__ == "__main__":
    stage_1()
    stage_2()
    stage_3()

Upvotes: 0

Views: 415

Answers (1)

MRocklin
MRocklin

Reputation: 57281

I suspect you'll run into a few issues.

Function Purity

Dask generally assumes that functions are pure rather than rely on side effects. If you want to use Dask then I recommend that you change your functions so that they return data rather than produce files.

As a hacky workaround you could pass filenames between functions.

No Parallelism

The workflow you've described has no intrinsic parallelism. You can have dask run your functions but it will just run them one after the other. You would need to think about how to break open your computation a bit so that there are several function calls that could run in parallel. Dask will not do this thinking for you.

Upvotes: 1

Related Questions