Reputation: 1493
I have a dataframe contains millions data. Suppose this is the dataframe named mydataframe
:
filename | #insert-1 | #insert-2 | #delete-1 | #delete-2
---------------------------------------------------------
A | 4 | 4 | 3 | 3
B | 3 | 5 | 2 | 2
C | 5 | 5 | 6 | 7
D | 2 | 2 | 3 | 3
E | 4 | 5 | 5 | 3
---------------------------------------------------------
I need to separate the file based on the different number of insert or delete, then save them into new CSV
file, named different.csv
. And also save the rest of the data having the same number of insert and delete in the separate CSV
file called same.csv
. In the other words, if the file has a different number between #insert-1
and #insert-2
, or #delete-1
and #delete-2
then save it in different.csv
, otherwise, save it in same.csv
.
The expected result:
different.csv
:
filename | #insert-1 | #insert-2 | #delete-1 | #delete-2
---------------------------------------------------------
B | 3 | 5 | 2 | 2
C | 5 | 5 | 6 | 7
E | 4 | 5 | 5 | 3
---------------------------------------------------------
same.csv
filename | #insert-1 | #insert-2 | #delete-1 | #delete-2
---------------------------------------------------------
A | 4 | 4 | 3 | 3
D | 2 | 2 | 3 | 3
---------------------------------------------------------
This is my code so far:
df_different = []
df_same = []
for row in range(0, len(mydataframe)):
ins_1 = mydataframe.iloc[row][1]
ins_2 = mydataframe.iloc[row][2]
del_1 = mydataframe.iloc[row][3]
del_2 = mydataframe.iloc[row][4]
if (ins_1 != ins_2) or (del_1 != del_2):
df_different.append(mydataframe.iloc[row])
else:
df_same.append(mydataframe.iloc[row])
with open('different.csv','w') as diffcsv:
writers = csv.writer(diffcsv, delimiter=',')
writers.writerow(fields)
for item in df_different:
writers.writerow(item)
with open('same.csv','w') as diffcsv:
writers = csv.writer(diffcsv, delimiter=',')
writers.writerow(fields)
for item in df_same:
writers.writerow(item)
Actually, the code works well but when the dataset is very large (I have millions of data), it takes very long time (more than 3 hours) to perform. My question is whether there is a method to make it faster. Thank you.
Upvotes: 4
Views: 253
Reputation: 212
Use directly Data Frame query:
Same_data frame:
same_dataframe = mydataframe[(mydataframe["insert1"] == mydataframe["insert2"]) & (mydataframe["delete1"] == mydataframe["delete2"])]
Different Dataframe:
different_data = mydataframe[(mydataframe["insert1"] != mydataframe["insert2"]) | (mydataframe["delete1"] != mydataframe["delete2"])]
I think, it is faster than iteration.
Hope, It will help.
Upvotes: 2
Reputation: 353059
Avoid iterating over rows; that's pretty slow. Instead, vectorize the comparison operation:
same_mask = (df["#insert-1"] == df["#insert-2"]) & (df["#delete-1"] == df["#delete-2"])
df.loc[same_mask].to_csv("same.csv", index=False)
df.loc[~same_mask].to_csv("different.csv", index=False)
For a dataframe of 1M rows, this takes me only a few seconds.
Upvotes: 6
Reputation: 16660
One of the easy things you can do is to provide sufficiently large buffers to open
function (buffering=64*1024*1024
) could help (64MB buffer).
Another thing is iteration over the dataframe - instead of iterating over row numbers you could iterate directly over rows, like:
for index, row in mydataframe.iterrows():
ins_1 = row[1]
ins_2 = row[2]
del_1 = row[3]
del_2 = row[4]
I would expect it to be much faster.
Upvotes: 3