krishna
krishna

Reputation: 31

check if any value of a column in csv file exists in a column in a second csv file

I want to compare columns in two csv files. Basically check if any value in one column exists in another column. If they do exist print out any such values.

Ex: file1:

id value
abc 789
efg 766
hij 456

file2:

id value
klm 789
nop 766
abc 456

I need to compare if any values in file2 'id' column exist in file1 'id' column. In the example above 'abc' is one value that is repeated and needs to be print out.

Is there a bash script that can do this?

Upvotes: 0

Views: 1147

Answers (3)

James Brown
James Brown

Reputation: 37394

Using join (and tail and sort and Bash's process substitution):

$ join -j 1 -o "1.1" <(tail -n +2 file1 | sort) <(tail -n +2 file2 | sort)
abc

Explained:

  • join -j 1 -o "1.1" join on the first space-separated field, output the first field of the first file
  • <(...) Bash's process substitution
  • tail -n +2 file1 ditch the header
  • | sort join expects the files to be sorted

(Yeah, I'd use @RamanSailopal's awk solution, too, ++)

Upvotes: 0

Raman Sailopal
Raman Sailopal

Reputation: 12867

Using awk:

awk -F, 'FNR==1 { next } NR==FNR { map[$1]=$2;next } map[$1]!="" { print;print $1"\t"map[$1] } ' file1 file2

If the line number is 1 (FNR==1), skip to the next line. When processing the first file (NR=FNR), create an array map with the first space separated field as the index and the second field the value. Then, when processing the second file, if there is an entry for the first field in map, print the line along with the entry in the map array.

Upvotes: 1

Burhan Bilen
Burhan Bilen

Reputation: 93

If you are using python, you can use pandas library.

import pandas as pd

df = pd.DataFrame(yourdata, columns = ['X', 'Y', 'Z']) 
duplicate = df[df.duplicated()]

print(duplicate)

For more detailed info, you can check this page.

Upvotes: 0

Related Questions