Reputation: 463
I'm processing a large text file in memory in 3 stages (currently not using pandas
/dataframes
)
This takes one raw data text file and processes it in four stages.
Stage 1 processes raw.txt
and kicks out stage1.txt
Stage 2 processes stage1.txt
and kicks out stage2.txt
Stage 3 processes stage2.txt
and kicks out results.txt
How should I set a Dask script to work on this locally?
Beyond this, how would can you set it up to work with multiple of raw.txt
.
(i.e. raw1, raw2, raw3)
At the moment each stage method does not return anything but writes the next file to a specific file location which the next method knows about.
def stage_1():
outputFile=r"C:\Data\Processed\stage_1.txt"
inputFile=r"C:\Data\RawData\rawData.txt"
f1 = open(outputFile,"w+")
f2 = open(inputFile,'r')
#Process input file f2
#Write results to f1
f2.close()
f1.close()
if __name__ == "__main__":
stage_1()
stage_2()
stage_3()
Upvotes: 0
Views: 415
Reputation: 57281
I suspect you'll run into a few issues.
Dask generally assumes that functions are pure rather than rely on side effects. If you want to use Dask then I recommend that you change your functions so that they return data rather than produce files.
As a hacky workaround you could pass filenames between functions.
The workflow you've described has no intrinsic parallelism. You can have dask run your functions but it will just run them one after the other. You would need to think about how to break open your computation a bit so that there are several function calls that could run in parallel. Dask will not do this thinking for you.
Upvotes: 1