Is there a faster way to check the similar data value in a column from csv file using python?

Question

I have 2 csv files (suppose CSV-1.csv and CSV-2.csv) with more than ten thousand data in each. These 2 files have 2 same columns, namely 'filename' and 'number_of_changes'. My purpose is to compare these 2 files based on the filename. If the filename in CSV-1 is also in the CSV-2, then check the column "number_of_changes". If the number of changes with the same filename between these 2 csv files is different, then store this data in a new csv file with 3 columns (suppose New_CSV.csv). But if the filename and the number_of_changes in these 2 files is same, then pass it (do not store in new csv). Let's suppose my CSV files:

CSV-1.csv

filename     |  number_of_changes
---------------------------------
A            |   20
B            |   10
C            |   10
E            |    5
F            |   15

CSV-2.csv

filename     |  number_of_changes
A            |   20
B            |   15
D            |   30
E            |   10

I need to create the new csv file (New_CSV.csv) from the 2 files above:

filename     |  number_of_changes-1 | number_of_changes-2
----------------------------------------------------------
B            |   10                 |  15
C            |   10                 |   0
D            |    0                 |  30
E            |    5                 |  10
F            |   15                 |   0

I've written the code so far:

import pandas as pd

cols = ['filename','number_of_changes']
data1 = pd.read_csv('CSV-1.csv')
data2 = pd.read_csv('CSV-2.csv')
df1 = data1[cols]
df2 = data2[cols]

lshist = []
for x in range(0,len(df1)-1):
    lshist.append(list(df1.iloc[x]))

lsmyers = []
for y in range(0,len(df2)-1):
    lsmyers.append(list(df2.iloc[y]))

with open('New_CSV.csv', 'w') as csvfile:
    header = ['filename', 'number_of_changes-1', 'number_of_changes-2']
    writers = csv.writer(csvfile, delimiter=',')
    writers.writerow(header)
    for fn in range(1,len(lshist)-1):
        tmp = []
        fnhist = lshist[fn][0]
        for x in range(0,len(lsmyers)-1):
            sys.stdout.write('\rSearching in myers : %i' % (x+1) + ' out of %i' % (len(lsmyers)-1) + ' in % i' % (fn) + ' out of %i' % (len(lshist)-1) + ' in histogram')
            sys.stdout.flush()
            if fnhist != lsmyers[x][0]:
                pass
            else:
                if lshist[fn][1] == lsmyers[x][1]:
                    pass
                else:
                    tmp = [fnhist,lshist[fn][1],lsmyers[x][1]]

        writers.writerow(tmp)

When I run the code, it works only for the files which are in both CSV files. But it doesn't work for the file which is not in the other CSV file. Also, one disadvantage is that it takes very long time when the data is very large. So, is there faster and better algorithm? Please your suggestion. Thank you.

Sociopath · Accepted Answer

You need something like this

df1 = pd.read_csv('CSV-1.csv')
df2 = pd.read_csv('CSV-2.csv')

Merge both dataframes

df3 = df1.merge(df2, on="filename", how="outer",suffixes=('-1', '-2'))

Remove the rows with same number of changes

df3 = df3[df3['number_of_changes-1'] != df3['number_of_changes-2']]

Fill NAs with 0 and sort by filename

df3.fillna(0, inplace=True)
df3 = df3.sort_values(by ='filename').reset_index(drop=True)

output:

    filename    number_of_changes-1     number_of_changes-2
0       B       10.0                    15.0
1       C       10.0                     0.0
2       D        0.0                    30.0
3       E        5.0                    10.0
4       F       15.0                     0.0

Is there a faster way to check the similar data value in a column from csv file using python?

Answers (2)

Related Questions