Reputation: 87
I'm a student in Korea and I'm using python to analyze option data(finance). I'm finding a better way to speeding up the performance of my python code.
Target data is the transaction record(per minute) of the options and the period is from 2015 to 2019. Because the data is divided into 1227(the number of workdays during 5 years) files(txt), I tried to concatenate all 1227 files to minimize the number of accession to the memory. This is because I will use the result file(concatenated file = preprocessed file) repeatedly and accessing every separated file took too much time. Below is some part of my code.
#file_name is list type and it contains all names of the 1227 day files ordered by date
result_df = pd.DataFrame()
for f in file_name:
data_opt = pd.read_csv(location + f, header = None, sep = "\t")
#do something
#...
#...
oneday_df = pd.concat([minute_call, minute_put], axis = 0) #result of the processing one day data
result_df = pd.concat([result_df, oneday_df], axis = 0)
result_df.to_csv()
This code works and I could get the proper result. However, I could see that the speed slowed down as time goes by. It means that my code works fast when it processes early data but its speed slows down when it processes late data. Is there any better way to speeding up the performance of my python code?
(Sorry for my awkward English and thank you for reading all questions)
Upvotes: 2
Views: 1724
Reputation: 5026
I timed two approaches to combine 1000 csv files with 100 rows and 10 columns
First generated the test files and got the filenames in a list
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(100,10))
for i in range(1000):
df.to_csv(f'../data/test/file{i:04d}.csv', index=False)
import glob
files = glob.glob('../data/test/file*.csv')
Then I timed both methods
append
to list, concat
all, to_csv
%%time
l = []
for file in files:
l.append(pd.read_csv(file))
pd.concat(l).to_csv('../data/test/concat_files.csv', index=False)
Out:
CPU times: user 4.7 s, sys: 547 ms, total: 5.25 s
Wall time: 19.5 s
append
mode to csv file
%%time
pd.read_csv(files[0])[:0].to_csv('../data/test/append_files.csv', index=False)
for file in files:
pd.read_csv(file).to_csv('../data/test/append_files.csv', mode='a', header=False, index=False)
Out:
CPU times: user 8.09 s, sys: 1.19 s, total: 9.28 s
Wall time: 9.84 s
Upvotes: 1
Reputation: 894
And I think rather than concatenating inside the for-loop, store those dataframes in list and concatenate them after the for-loop.
Something like:
minute_something = []
for f in file_name:
data_opt = pd.read_csv(location + f, header = None, sep = "\t")
#do something
#...
#...
minute_something.append(minute_put)
minute_something.append(minute_call) #result of the processing one day data
result_df = pd.concat(munute_something, axis=0)
Upvotes: 2
Reputation: 6940
Rather than concatenating in memory, keep the output CSV file open and write each part to it separately as you go?
That way you'll never have more than one day's worth of data in memory at a time, improving not only speed but also memory consumption.
Something like:
with open('out_file.csv', 'w') as of:
for i, f in enumerate(file_name):
data_opt = pd.read_csv(location + f, header = None, sep = "\t")
#do something
#...
#...
oneday_df = pd.concat([minute_call, minute_put], axis = 0) #result of the processing one day data
is_first_part = (i == 0)
oneday_df.to_csv(of, header=is_first_part)
Upvotes: 2