Reputation: 211
I have a CSV file like:
item1,item2
A,B
B,C
C,D
E,F
I want to compare this two column and find the similar content from the two columns item1
and item2
. The output should be like this:
item
B
C
I have tried this code
with open('output/id.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile)
for line in csvreader:
if (line[0] == line[1]):
print line
else:
print("not match")
I am new to programming. I don't know what the logic should be and how to solve this problem. please help.
Upvotes: 0
Views: 12297
Reputation: 140168
You cannot succeed by reading row by rows. You have to work on the columns.
Read both columns of your csv file (without the title) into 2 python set
s.
Perform sorted intersection and write back to another csv file:
import csv
with open("test.csv") as f:
cr = csv.reader(f)
next(cr) # skip title
col1 = set()
col2 = set()
for a,b in cr:
col1.add(a)
col2.add(b)
with open("output.csv","w",newline="") as f:
cw = csv.writer(f)
cw.writerow(["item"])
cw.writerows(sorted(col1 & col2))
with test.csv
as:
item1,item2
A,B
B,C
C,D
E,F
you get
item
B
C
note: if your csv file has more than 2 columns, the unpack doesn't work properly, adapt like this:
for row in cr:
col1.add(row[0])
col2.add(row[1])
Upvotes: 1
Reputation: 1702
You need to:
'\t'
as your delimiter, as your file is delimited by tabs, not commasHere's my implementation:
import csv
with open('output/id.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile, delimiter='\t')
items_in_1 = set()
items_in_2 = set()
for line in csvreader:
items_in_1.add(line[0])
items_in_2.add(line[1])
items_in_both = items_in_1.intersection(items_in_2)
print("item")
for item in items_in_both:
print(item)
Upvotes: 2
Reputation: 2706
I would recommend you use the pandas
library, this will load your csv file into a nice dataframe data structure. Really convenient.
import pandas as pd
df = pd.read_csv(filename)
Then you can get the similarities between both columns by doing
set(df['col1']) & set(df['col2'])
To get the output shaped the way you describe you can then make a new DataFrame with this intersected information as
df2 = pd.DataFrame(data = {'item': list(set(df['col1']) & set(df['col2']))})
For example,
import pandas as pd
d = {'col1': [1, 2, 6, 4, 3], 'col2': [3, 2, 5, 6, 8]}
df = pd.DataFrame(data=d)
set(df['col1']) & set(df['col2'])
{2, 3, 6}
Upvotes: 2