Angelo
Angelo

Reputation: 5059

Comparing values in two files

I am comparing two files, each having one column and n number of rows.

file 1

vincy
alex
robin

file 2

Allen
Alex
Aaron
ralph
robin

if the data of file 1 is present in file 2 it should return 1 or else 0, in a tab seprated file.

Something like this

vincy 0
alex 1
robin 1

What I am doing is

#!/bin/bash
for i in `cat file1 `
do
cat file2 | awk '{ if ($1=="'$i'") print 1 ; else print 0 }'>>binary
done

the above code is not giving me the output which I am looking for.

Kindly have a look and suggest correction.

Thank you

Upvotes: 2

Views: 292

Answers (6)

Dennis Williamson
Dennis Williamson

Reputation: 360055

AWK loves to do this kind of thing.

awk 'FNR == NR {a[tolower($1)]; next} {f = 0; if (tolower($1) in a) {f = 1}; print $1, f}' file2 file1

Swap the positions of file2 and file1 in the argument list to make file1 the dictionary instead of file2.

When FNR (the record number in the current file) and NR (the record number of all records so far) are equal, then the first file is the one being processed. Simply referencing an array element brings it into existence. This sets up the dictionary. The next instruction reads the next record.

Once FNR and NR aren't equal, subsequent file(s) are being processed and their data is looked up in the dictionary array.

Upvotes: 2

William Pursell
William Pursell

Reputation: 212248

The simple awk solution:

awk 'NR==FNR{ seen[$0]=1 } NR!=FNR{ print $0 " " seen[$0] + 0}' file2 file1

A simple explanation: for the lines in file2, NR==FNR, so the first action is executed and we simply record that a line has been seen. In file1, the 2nd action is taken and the line is printed, followed by a space, followed by a "0" or a "1", depending on if the line was seen in file2.

Upvotes: 2

wap26
wap26

Reputation: 2290

Another solution, if you have python installed. If you're familiar with Python and are interested in the solution, you only need a bit of formatting.

#/bin/python
f1 = open('file1').readlines()
f2 = open('file2').readlines()
f1_in_f2 = [int(x in f2) for x in f1]
for n,c in zip(f1, f1_in_f2):
    print n,c

Upvotes: 1

Charles Duffy
Charles Duffy

Reputation: 295403

The comm command exists to do this kind of comparison for you.

The following approach does only one pass and scales well to very large input lists:

#!/bin/bash
while read; do
        if [[ $REPLY = $'\t'* ]] ; then
                printf "%s\t0\n" "${REPLY#?}"
        else
                printf "%s\t1\n" "${REPLY}"
        fi
done < <(comm -2 <(tr '[A-Z]' '[a-z]' <file1 | sort) <(tr '[A-Z]' '[a-z]' <file2 | sort))

See also BashFAQ #36, which is directly on-point.

Upvotes: 1

kojiro
kojiro

Reputation: 77107

There are several decent approaches. You can simply use line-by-line set math:

{
    grep -xF -f file1 file2 | sed $'s/$/\t1/'
    grep -vxF -f file1 file2 | sed $'s/$/\t0/'
} > somefile.txt

Another approach would be to simply combine the files and use uniq -c, then just swap the numeric column with something like awk:

sort file1 file2 | uniq -c | awk '{ print $2"\t"$1 }'

Upvotes: 1

wap26
wap26

Reputation: 2290

The following code should do it.

Take a close look to the BEGIN and END sections.

#!/bin/bash
rm -f binary
for i in $(cat file1); do
     awk 'BEGIN {isthere=0;} { if ($1=="'$i'") isthere=1;} END { print "'$i'",isthere}' < file2 >> binary
done

Upvotes: 1

Related Questions