Reputation: 1493
I have 2 csv files (suppose CSV-1.csv and CSV-2.csv) with more than ten thousand data in each. These 2 files have 2 same columns, namely 'filename' and 'number_of_changes'. My purpose is to compare these 2 files based on the filename. If the filename in CSV-1 is also in the CSV-2, then check the column "number_of_changes". If the number of changes with the same filename between these 2 csv files is different, then store this data in a new csv file with 3 columns (suppose New_CSV.csv). But if the filename and the number_of_changes in these 2 files is same, then pass it (do not store in new csv). Let's suppose my CSV files:
CSV-1.csv
filename | number_of_changes
---------------------------------
A | 20
B | 10
C | 10
E | 5
F | 15
CSV-2.csv
filename | number_of_changes
A | 20
B | 15
D | 30
E | 10
I need to create the new csv file (New_CSV.csv) from the 2 files above:
filename | number_of_changes-1 | number_of_changes-2
----------------------------------------------------------
B | 10 | 15
C | 10 | 0
D | 0 | 30
E | 5 | 10
F | 15 | 0
I've written the code so far:
import pandas as pd
cols = ['filename','number_of_changes']
data1 = pd.read_csv('CSV-1.csv')
data2 = pd.read_csv('CSV-2.csv')
df1 = data1[cols]
df2 = data2[cols]
lshist = []
for x in range(0,len(df1)-1):
lshist.append(list(df1.iloc[x]))
lsmyers = []
for y in range(0,len(df2)-1):
lsmyers.append(list(df2.iloc[y]))
with open('New_CSV.csv', 'w') as csvfile:
header = ['filename', 'number_of_changes-1', 'number_of_changes-2']
writers = csv.writer(csvfile, delimiter=',')
writers.writerow(header)
for fn in range(1,len(lshist)-1):
tmp = []
fnhist = lshist[fn][0]
for x in range(0,len(lsmyers)-1):
sys.stdout.write('\rSearching in myers : %i' % (x+1) + ' out of %i' % (len(lsmyers)-1) + ' in % i' % (fn) + ' out of %i' % (len(lshist)-1) + ' in histogram')
sys.stdout.flush()
if fnhist != lsmyers[x][0]:
pass
else:
if lshist[fn][1] == lsmyers[x][1]:
pass
else:
tmp = [fnhist,lshist[fn][1],lsmyers[x][1]]
writers.writerow(tmp)
When I run the code, it works only for the files which are in both CSV files. But it doesn't work for the file which is not in the other CSV file. Also, one disadvantage is that it takes very long time when the data is very large. So, is there faster and better algorithm? Please your suggestion. Thank you.
Upvotes: 2
Views: 149
Reputation: 13401
You need something like this
df1 = pd.read_csv('CSV-1.csv')
df2 = pd.read_csv('CSV-2.csv')
Merge both dataframes
df3 = df1.merge(df2, on="filename", how="outer",suffixes=('-1', '-2'))
Remove the rows with same number of changes
df3 = df3[df3['number_of_changes-1'] != df3['number_of_changes-2']]
Fill NAs with 0 and sort by filename
df3.fillna(0, inplace=True)
df3 = df3.sort_values(by ='filename').reset_index(drop=True)
output:
filename number_of_changes-1 number_of_changes-2
0 B 10.0 15.0
1 C 10.0 0.0
2 D 0.0 30.0
3 E 5.0 10.0
4 F 15.0 0.0
Upvotes: 1
Reputation: 5940
There is indeed an easier way - merge the dataframes in one. Basically something like this:
df1 = pd.read_csv("CSV-1.csv")
df2 = pd.read_csv("CSV-2.csv")
df3 = df1.merge(df2, on="filename", how="left")
df3.fillna(0, inplace=True)
# df3 would contain the following dataframe
###########
# filename number_of_changes_x number_of_changes_y
# 0 A 20 20.0
# 1 B 10 15.0
# 2 C 10 0.0
# 3 E 5 10.0
# 4 F 15 0.0
Basically merge the two dataframes in one and filling the np.NaN values with 0. Depending on your needs you may need to cast the second column to int. Also if you need to change the column names, use the rename method.
Upvotes: 2