How to read multiple files into multiple threads/processess to optimize data analyses?

Question

I am trying to read 3 different files in python and do something to extract the data outof it. Then I want to merge the data into one big file.

Since each individual files are already big and take sometime doing the data processing, I am thinking if

I can read all three files at once (in multiple threads/process)
wait for the process for all files to finish
when all output are ready then pipe all the data to downstream function to merge it.

Can someone suggest some improvement to this code to do what I want.

import pandas as pd

file01_output = ‘’
file02_output = ‘’
file03_output = ‘’

# I want to do all these three “with open(..)” at once.
with open(‘file01.txt’, ‘r’) as file01:
    for line in file01:
        something01 = do something in line
        file01_output += something01

with open(‘file02.txt’, ‘r’) as file01:
    for line in file01:
        something02 = do something in line
        file02_output += something02

with open(‘file03.txt’, ‘r’) as file01:
    for line in file01:
        something03 = do something in line
        file03_output += something03

def merge(a,b,c):
    a = file01_output
    b = file01_output
    c = file01_output

    # compile the list of dataframes you want to merge
    data_frames = [a, b, c]

    df_merged = reduce(lambda  left,right: pd.merge(left,right,
                       on=['common_column'], how='outer'), data_frames).fillna('.')

How to read multiple files into multiple threads/processess to optimize data analyses?

Answers (1)

Related Questions