YusufUMS
YusufUMS

Reputation: 1493

Is there a faster way to check the similar data value in a column from csv file using python?

I have 2 csv files (suppose CSV-1.csv and CSV-2.csv) with more than ten thousand data in each. These 2 files have 2 same columns, namely 'filename' and 'number_of_changes'. My purpose is to compare these 2 files based on the filename. If the filename in CSV-1 is also in the CSV-2, then check the column "number_of_changes". If the number of changes with the same filename between these 2 csv files is different, then store this data in a new csv file with 3 columns (suppose New_CSV.csv). But if the filename and the number_of_changes in these 2 files is same, then pass it (do not store in new csv). Let's suppose my CSV files:

CSV-1.csv

filename     |  number_of_changes
---------------------------------
A            |   20
B            |   10
C            |   10
E            |    5
F            |   15

CSV-2.csv

filename     |  number_of_changes
A            |   20
B            |   15
D            |   30
E            |   10

I need to create the new csv file (New_CSV.csv) from the 2 files above:

filename     |  number_of_changes-1 | number_of_changes-2
----------------------------------------------------------
B            |   10                 |  15
C            |   10                 |   0
D            |    0                 |  30
E            |    5                 |  10
F            |   15                 |   0

I've written the code so far:

import pandas as pd

cols = ['filename','number_of_changes']
data1 = pd.read_csv('CSV-1.csv')
data2 = pd.read_csv('CSV-2.csv')
df1 = data1[cols]
df2 = data2[cols]

lshist = []
for x in range(0,len(df1)-1):
    lshist.append(list(df1.iloc[x]))

lsmyers = []
for y in range(0,len(df2)-1):
    lsmyers.append(list(df2.iloc[y]))

with open('New_CSV.csv', 'w') as csvfile:
    header = ['filename', 'number_of_changes-1', 'number_of_changes-2']
    writers = csv.writer(csvfile, delimiter=',')
    writers.writerow(header)
    for fn in range(1,len(lshist)-1):
        tmp = []
        fnhist = lshist[fn][0]
        for x in range(0,len(lsmyers)-1):
            sys.stdout.write('\rSearching in myers : %i' % (x+1) + ' out of %i' % (len(lsmyers)-1) + ' in % i' % (fn) + ' out of %i' % (len(lshist)-1) + ' in histogram')
            sys.stdout.flush()
            if fnhist != lsmyers[x][0]:
                pass
            else:
                if lshist[fn][1] == lsmyers[x][1]:
                    pass
                else:
                    tmp = [fnhist,lshist[fn][1],lsmyers[x][1]]

        writers.writerow(tmp)

When I run the code, it works only for the files which are in both CSV files. But it doesn't work for the file which is not in the other CSV file. Also, one disadvantage is that it takes very long time when the data is very large. So, is there faster and better algorithm? Please your suggestion. Thank you.

Upvotes: 2

Views: 149

Answers (2)

Sociopath
Sociopath

Reputation: 13401

You need something like this

df1 = pd.read_csv('CSV-1.csv')
df2 = pd.read_csv('CSV-2.csv')

Merge both dataframes

df3 = df1.merge(df2, on="filename", how="outer",suffixes=('-1', '-2'))

Remove the rows with same number of changes

df3 = df3[df3['number_of_changes-1'] != df3['number_of_changes-2']]

Fill NAs with 0 and sort by filename

df3.fillna(0, inplace=True)
df3 = df3.sort_values(by ='filename').reset_index(drop=True)

output:

    filename    number_of_changes-1     number_of_changes-2
0       B       10.0                    15.0
1       C       10.0                     0.0
2       D        0.0                    30.0
3       E        5.0                    10.0
4       F       15.0                     0.0

Upvotes: 1

Alexander Ejbekov
Alexander Ejbekov

Reputation: 5940

There is indeed an easier way - merge the dataframes in one. Basically something like this:

df1 = pd.read_csv("CSV-1.csv")
df2 = pd.read_csv("CSV-2.csv")
df3 = df1.merge(df2, on="filename", how="left")
df3.fillna(0, inplace=True)
# df3 would contain the following dataframe
###########
# filename  number_of_changes_x number_of_changes_y
# 0 A   20  20.0
# 1 B   10  15.0
# 2 C   10  0.0
# 3 E   5   10.0
# 4 F   15  0.0

Basically merge the two dataframes in one and filling the np.NaN values with 0. Depending on your needs you may need to cast the second column to int. Also if you need to change the column names, use the rename method.

Upvotes: 2

Related Questions