ts10
ts10

Reputation: 87

Is there a better way to improve the concat speed?

I'm a student in Korea and I'm using python to analyze option data(finance). I'm finding a better way to speeding up the performance of my python code.

Target data is the transaction record(per minute) of the options and the period is from 2015 to 2019. Because the data is divided into 1227(the number of workdays during 5 years) files(txt), I tried to concatenate all 1227 files to minimize the number of accession to the memory. This is because I will use the result file(concatenated file = preprocessed file) repeatedly and accessing every separated file took too much time. Below is some part of my code.

#file_name is list type and it contains all names of the 1227 day files ordered by date

result_df = pd.DataFrame()
for f in file_name: 

    data_opt = pd.read_csv(location + f, header = None, sep = "\t")

    #do something
    #...
    #...

    oneday_df = pd.concat([minute_call, minute_put], axis = 0) #result of the processing one day data

    result_df = pd.concat([result_df, oneday_df], axis = 0)

result_df.to_csv()

This code works and I could get the proper result. However, I could see that the speed slowed down as time goes by. It means that my code works fast when it processes early data but its speed slows down when it processes late data. Is there any better way to speeding up the performance of my python code?

(Sorry for my awkward English and thank you for reading all questions)

Upvotes: 2

Views: 1724

Answers (3)

Michael Szczesny
Michael Szczesny

Reputation: 5026

I timed two approaches to combine 1000 csv files with 100 rows and 10 columns

  • [19.50 s] collect dataframes in a list and save the concatenated dataframes
  • [09.84 s] iteratively append to a csv file

First generated the test files and got the filenames in a list

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.rand(100,10))

for i in range(1000):
    df.to_csv(f'../data/test/file{i:04d}.csv', index=False)

import glob
files = glob.glob('../data/test/file*.csv')

Then I timed both methods
append to list, concat all, to_csv

%%time
l = []
for file in files:
    l.append(pd.read_csv(file))
pd.concat(l).to_csv('../data/test/concat_files.csv', index=False)

Out:

CPU times: user 4.7 s, sys: 547 ms, total: 5.25 s
Wall time: 19.5 s

append mode to csv file

%%time
pd.read_csv(files[0])[:0].to_csv('../data/test/append_files.csv', index=False)
for file in files:
    pd.read_csv(file).to_csv('../data/test/append_files.csv', mode='a', header=False, index=False)

Out:

CPU times: user 8.09 s, sys: 1.19 s, total: 9.28 s
Wall time: 9.84 s

Upvotes: 1

Azuuu
Azuuu

Reputation: 894

And I think rather than concatenating inside the for-loop, store those dataframes in list and concatenate them after the for-loop.

Something like:

minute_something = []
for f in file_name: 

    data_opt = pd.read_csv(location + f, header = None, sep = "\t")

    #do something
    #...
    #...
        
    minute_something.append(minute_put)
    minute_something.append(minute_call) #result of the processing one day data

result_df = pd.concat(munute_something, axis=0)

Upvotes: 2

Jiří Baum
Jiří Baum

Reputation: 6940

Rather than concatenating in memory, keep the output CSV file open and write each part to it separately as you go?

That way you'll never have more than one day's worth of data in memory at a time, improving not only speed but also memory consumption.

Something like:

with open('out_file.csv', 'w') as of:
    for i, f in enumerate(file_name): 

        data_opt = pd.read_csv(location + f, header = None, sep = "\t")

        #do something
        #...
        #...

        oneday_df = pd.concat([minute_call, minute_put], axis = 0) #result of the processing one day data

        is_first_part = (i == 0)
        oneday_df.to_csv(of, header=is_first_part)

Upvotes: 2

Related Questions