Reputation: 1
I have a million of csv files each having 441 rows and 8 columns. I open each file and check if the row 221 has any column greater than 60. If so, I make all the values in that column as "-1" for all the rows.
For example:
Input
row 220: 65,13,15,27,18,51,20,79
row 221: 25,23,45,67,12,11,23,69
row 222: 12,12,14,15,16,17,19,22
Output
row 220: 65,13,15,-1,18,51,20,-1
row 221: 25,23,45,-1,12,11,23,-1
row 222: 12,12,14,-1,16,17,19,-1
Once I do the above processing, I copy this contents into another file. I do the above for all the files.
The code:
file_list=[]
mypath1=os.path.join(mypath,dut) // dut refers to the directory name
out_path1=os.path.join(mypath1,folder1)
if not os.path.exists(out_path1):
os.mkdir(out_path1)
for i in listdir(mypath1):
if i.startswith("PD") and i.endswith(".csv"):
file_list.append(i)
for j in file_list:
#print j
f = open(os.path.join(mypath1,j),'r')
f5=csv.reader(f)
sec=[]
f5 = list(f5)
for col in range(0,8):
if int(f5[220][col]) <= 60:
sec.append(col)
for r in range(0,441):
for value in sec:
f5[r][value] = -1
filename = "temp1_" + j
f2 = open(os.path.join(out_path1,filename),'w+')
f1=csv.writer(f2)
f1.writerows(f5)
f2.close()
f.close()
flag=1
The code is working fine, but the time taken for processing around 300 000 csv files is around 1 hour (opening a file, doing the above operation and writing to another file is approximately 0.01 second).
Is there any other way to speed up the above process? I have 20 other directories with same amount of files. In that case, total time taken would be 20 hours.
Upvotes: 0
Views: 112
Reputation: 2003
Pandas' pandas.read_csv
is faster compared to csv.reader
. It should suit your application better: read_csv. Corresponding function to write is to_csv.
A comparison can be found here:Fastest Python library to read a CSV file. Reproducing partial statistics from the above link: (Test run on windows 7)
open_with_python_csv: 1.57318865672 seconds
open_with_pandas_read_csv: 0.371965476805 seconds
read_csv
returns a pandas data_frame. It provides a method iloc
(index location) to get integer based indexing (there are many other access method to suit different requirements). A simple example would look like
import pandas as pd
df = pd.read_csv("foo.csv")
row5 = df.iloc[4]
col3 = df.iloc[:][2]
A lot can be done with it however it would be too broad to add everything to the answer. I have included the basics which should solve your problem or at least move it towards resolution.
Upvotes: 3