jan
jan

Reputation: 211

Compare two column from CSV file using python

I have a CSV file like:

item1,item2 
A,B
B,C
C,D
E,F

I want to compare this two column and find the similar content from the two columns item1 and item2. The output should be like this:

 item 
  B
  C

I have tried this code

with open('output/id.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile)

for line in csvreader:
    if (line[0] == line[1]):
        print line
    else:
        print("not match")

I am new to programming. I don't know what the logic should be and how to solve this problem. please help.

Upvotes: 0

Views: 12297

Answers (3)

Jean-François Fabre
Jean-François Fabre

Reputation: 140168

You cannot succeed by reading row by rows. You have to work on the columns.

Read both columns of your csv file (without the title) into 2 python sets.

Perform sorted intersection and write back to another csv file:

import csv

with open("test.csv") as f:
    cr = csv.reader(f)
    next(cr) # skip title
    col1 = set()
    col2 = set()
    for a,b in cr:
        col1.add(a)
        col2.add(b)

with open("output.csv","w",newline="") as f:
    cw = csv.writer(f)
    cw.writerow(["item"])
    cw.writerows(sorted(col1 & col2))

with test.csv as:

item1,item2
A,B
B,C
C,D
E,F

you get

item
B
C

note: if your csv file has more than 2 columns, the unpack doesn't work properly, adapt like this:

for row in cr:
    col1.add(row[0])
    col2.add(row[1])

Upvotes: 1

Ollie
Ollie

Reputation: 1702

You need to:

  1. Use '\t' as your delimiter, as your file is delimited by tabs, not commas
  2. Get all the items from both lists as a set, then get the intersection of the two sets
  3. Print them

Here's my implementation:

import csv
with open('output/id.csv', 'r') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')

    items_in_1 = set()
    items_in_2 = set()

    for line in csvreader:
        items_in_1.add(line[0])
        items_in_2.add(line[1])

    items_in_both = items_in_1.intersection(items_in_2)

    print("item")
    for item in items_in_both:
        print(item)

Upvotes: 2

JahKnows
JahKnows

Reputation: 2706

I would recommend you use the pandas library, this will load your csv file into a nice dataframe data structure. Really convenient.

import pandas as pd

df = pd.read_csv(filename)

Then you can get the similarities between both columns by doing

set(df['col1']) & set(df['col2'])

To get the output shaped the way you describe you can then make a new DataFrame with this intersected information as

df2 = pd.DataFrame(data = {'item': list(set(df['col1']) & set(df['col2']))})

For example,

import pandas as pd
d = {'col1': [1, 2, 6, 4, 3], 'col2': [3, 2, 5, 6, 8]}
df = pd.DataFrame(data=d)
set(df['col1']) & set(df['col2'])

{2, 3, 6}

Upvotes: 2

Related Questions