I want to compare columns in two csv files. Basically check if any value in one column exists in another column. If they do exist print out any such values. Ex: file1: id value abc 789 efg 766 hij 456 file2: id value klm 789 nop 766 abc 456 I need to compare if any values in file2 'id' column exist in file1 'id' column. In the example above 'abc' is one value that is repeated and needs to be print out. Is there a bash script that can do this?

Reputation: 31

check if any value of a column in csv file exists in a column in a second csv file

I want to compare columns in two csv files. Basically check if any value in one column exists in another column. If they do exist print out any such values.

Ex: file1:

id	value
abc	789
efg	766
hij	456

file2:

id	value
klm	789
nop	766
abc	456

I need to compare if any values in file2 'id' column exist in file1 'id' column. In the example above 'abc' is one value that is repeated and needs to be print out.

Is there a bash script that can do this?

Upvotes: 0

Answers (3)

James Brown

Reputation: 37394

Using join (and tail and sort and Bash's process substitution):

$ join -j 1 -o "1.1" <(tail -n +2 file1 | sort) <(tail -n +2 file2 | sort)
abc

Explained:

join -j 1 -o "1.1" join on the first space-separated field, output the first field of the first file
<(...) Bash's process substitution
tail -n +2 file1 ditch the header
| sort join expects the files to be sorted

(Yeah, I'd use @RamanSailopal's awk solution, too, ++)

Upvotes: 0

Raman Sailopal

Reputation: 12867

Using awk:

awk -F, 'FNR==1 { next } NR==FNR { map[$1]=$2;next } map[$1]!="" { print;print $1"\t"map[$1] } ' file1 file2

If the line number is 1 (FNR==1), skip to the next line. When processing the first file (NR=FNR), create an array map with the first space separated field as the index and the second field the value. Then, when processing the second file, if there is an entry for the first field in map, print the line along with the entry in the map array.

Upvotes: 1

Burhan Bilen

Reputation: 93

If you are using python, you can use pandas library.

import pandas as pd

df = pd.DataFrame(yourdata, columns = ['X', 'Y', 'Z']) 
duplicate = df[df.duplicated()]

print(duplicate)

For more detailed info, you can check this page.

Upvotes: 0

check if any value of a column in csv file exists in a column in a second csv file

Answers (3)

Related Questions