Reputation: 760
I have a pandas dataframe of about 2 million rows (80 columns each).
I would like to output the dataframe to csv as well as a parquet file.
Assume dataframe is present in the df
variable
Initial approach:
print('Creating csv and parquet files')
st = time.time()
df.to_csv('output_file.csv')
df.to_parquet('output_file.parquet')
print(f'Created csv and parquet files in {time.time() - st} seconds')
Writing to files using this approach takes too long. I assumed since these two are separate operations, I can take the advantage of multiple processes.
Newer approach:
def build_csv(dataframe, output_filename):
print(f'Building csv: {output_filename}')
dataframe.to_csv(output_filename)
def build_parquet(dataframe, output_filename):
print(f'Building parquet: {output_filename}')
dataframe.to_parquet(output_filename)
with ProcessPoolExecutor(max_workers=3) as executor:
executor.submit(build_csv, (df, 'output_file.csv'))
executor.submit(build_parquet, (df, 'output_file.parquet'))
The newer approach runs successfully but I do not see any files being created. I am not sure why this happens.
Is there a better (quicker) approach for writing a pandas dataframe to different files?
Upvotes: 1
Views: 1434
Reputation: 3527
EDIT: I kept the threading solution below for your reference. However, this solution should solve the Python GIL problem. I've tested it and can see that the files have been written successfully:
from multiprocessing import Pool
import pandas as pd
# original data:
data = pd.DataFrame([
[ 1, 2, 3, 4,],
[ 1, 2, 3, 4,],
[ 1, 2, 3, 4,],
[ 1, 2, 3, 4,],
[ 1, 2, 3, 4,],
])
def SaveDataToCsv(data):
print('Started export to .csv')
data.to_csv('data.csv')
print('Finished export to .csv')
def SaveDataToParquet(data):
print('Started export to .parquet')
data.to_parquet('data.parquet')
print('Finished export to .parquet')
# multiprocessing method:
pool = Pool(processes=2)
process1 = pool.apply_async(SaveDataToCsv, [data])
process2 = pool.apply_async(SaveDataToParquet, [data])
Tested the threading
library and it seems to work fine:
import pandas as pd
import threading
# original data:
data = pd.DataFrame([
[ 1, 2, 3, 4,],
[ 1, 2, 3, 4,],
[ 1, 2, 3, 4,],
[ 1, 2, 3, 4,],
[ 1, 2, 3, 4,],
])
def SaveDataToCsv(data):
data.to_csv('data.csv')
def SaveDataToParquet(data):
data.to_parquet('data.parquet')
thread1 = threading.Thread(target=SaveDataToCsv, args=(data,))
thread2 = threading.Thread(target=SaveDataToParquet, args=(data,))
thread1.start()
thread2.start()
Upvotes: 3
Reputation: 1492
Since you write to the same disk, the disk is the bottleneck and your multiprocessing will not speed up the operation.
Upvotes: 0