yashtodi94
yashtodi94

Reputation: 760

Efficiently write large pandas data to different files

I have a pandas dataframe of about 2 million rows (80 columns each).

I would like to output the dataframe to csv as well as a parquet file.

Assume dataframe is present in the df variable

Initial approach:

print('Creating csv and parquet files')
st = time.time()
df.to_csv('output_file.csv')
df.to_parquet('output_file.parquet')
print(f'Created csv and parquet files in {time.time() - st} seconds')

Writing to files using this approach takes too long. I assumed since these two are separate operations, I can take the advantage of multiple processes.

Newer approach:

def build_csv(dataframe, output_filename):
    print(f'Building csv: {output_filename}')
    dataframe.to_csv(output_filename)


def build_parquet(dataframe, output_filename):
    print(f'Building parquet: {output_filename}')
    dataframe.to_parquet(output_filename)


with ProcessPoolExecutor(max_workers=3) as executor:
    executor.submit(build_csv, (df, 'output_file.csv'))
    executor.submit(build_parquet, (df, 'output_file.parquet'))

The newer approach runs successfully but I do not see any files being created. I am not sure why this happens.

Is there a better (quicker) approach for writing a pandas dataframe to different files?

Upvotes: 1

Views: 1434

Answers (2)

Daniel
Daniel

Reputation: 3527

EDIT: I kept the threading solution below for your reference. However, this solution should solve the Python GIL problem. I've tested it and can see that the files have been written successfully:

from multiprocessing import Pool
import pandas as pd

# original data:
data = pd.DataFrame([
    [ 1, 2, 3, 4,], 
    [ 1, 2, 3, 4,], 
    [ 1, 2, 3, 4,], 
    [ 1, 2, 3, 4,], 
    [ 1, 2, 3, 4,],
])    


def SaveDataToCsv(data):
    print('Started export to .csv')
    data.to_csv('data.csv')
    print('Finished export to .csv')


def SaveDataToParquet(data):
    print('Started export to .parquet')
    data.to_parquet('data.parquet')
    print('Finished export to .parquet')


# multiprocessing method:
pool = Pool(processes=2)
process1 = pool.apply_async(SaveDataToCsv, [data])
process2 = pool.apply_async(SaveDataToParquet, [data])

Tested the threading library and it seems to work fine:

import pandas as pd
import threading

# original data:
data = pd.DataFrame([
    [ 1, 2, 3, 4,],
    [ 1, 2, 3, 4,],
    [ 1, 2, 3, 4,],
    [ 1, 2, 3, 4,],
    [ 1, 2, 3, 4,],
])


def SaveDataToCsv(data):        
    data.to_csv('data.csv')


def SaveDataToParquet(data):
    data.to_parquet('data.parquet')    

thread1 = threading.Thread(target=SaveDataToCsv, args=(data,))
thread2 = threading.Thread(target=SaveDataToParquet, args=(data,))

thread1.start()
thread2.start()

Upvotes: 3

Dewald Abrie
Dewald Abrie

Reputation: 1492

Since you write to the same disk, the disk is the bottleneck and your multiprocessing will not speed up the operation.

Upvotes: 0

Related Questions