Reputation: 31
I want to compare columns in two csv files. Basically check if any value in one column exists in another column. If they do exist print out any such values.
Ex: file1:
id | value |
---|---|
abc | 789 |
efg | 766 |
hij | 456 |
file2:
id | value |
---|---|
klm | 789 |
nop | 766 |
abc | 456 |
I need to compare if any values in file2 'id' column exist in file1 'id' column. In the example above 'abc' is one value that is repeated and needs to be print out.
Is there a bash script that can do this?
Upvotes: 0
Views: 1147
Reputation: 37394
Using join
(and tail
and sort
and Bash's process substitution):
$ join -j 1 -o "1.1" <(tail -n +2 file1 | sort) <(tail -n +2 file2 | sort)
abc
Explained:
join -j 1 -o "1.1"
join on the first space-separated field, output the first field of the first file<(...)
Bash's process substitutiontail -n +2 file1
ditch the header| sort
join
expects the files to be sorted(Yeah, I'd use @RamanSailopal's awk solution, too, ++)
Upvotes: 0
Reputation: 12867
Using awk:
awk -F, 'FNR==1 { next } NR==FNR { map[$1]=$2;next } map[$1]!="" { print;print $1"\t"map[$1] } ' file1 file2
If the line number is 1 (FNR==1), skip to the next line. When processing the first file (NR=FNR), create an array map with the first space separated field as the index and the second field the value. Then, when processing the second file, if there is an entry for the first field in map, print the line along with the entry in the map array.
Upvotes: 1
Reputation: 93
If you are using python, you can use pandas
library.
import pandas as pd
df = pd.DataFrame(yourdata, columns = ['X', 'Y', 'Z'])
duplicate = df[df.duplicated()]
print(duplicate)
For more detailed info, you can check this page.
Upvotes: 0