gyrous
gyrous

Reputation: 209

how can i compare two text files which has multiple fields in unix

i have two text files

i want to compare those two text files. if the four columns of file 2 is there in file 1 and equal means i want output like this

7000,john,2,0,0,1,6
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1

nawk -F"," 'NR==FNR {a[$1];next} ($1 in a)' file2.txt file1.txt.. this works good for comparing two single column in two files. i want to compare multiple column. any one have suggestion?


EDIT: From the OP's comments:

nawk -F"," 'NR==FNR {a[$1];next} ($1 in a)' file2.txt file1.txt

.. this works good for comparing two single column in two files. i want to compare multiple column. you have any suggestion?

Upvotes: 1

Views: 14911

Answers (7)

Jeff Burdges
Jeff Burdges

Reputation: 4261

The statistical package R handles processing multiple csv tables really easily. See An Intro. to R or R for Beginners.

Upvotes: 0

SiegeX
SiegeX

Reputation: 140327

This awk one-liner works for multi-column on unsorted files:

awk -F, 'NR==FNR{a[$1,$2,$3,$4]++;next} (a[$1,$3,$6,$7])' file1.txt file2.txt

In order for this to work, it is imperative that the first file used for input (file1.txt in my example) be the file that only has 4 fields like so:

file1.txt

7000,2,1,6
7001,2,1,7
7002,2,1,6
7003,1,2,1

file2.txt

7000,john,2,0,0,1,6
7000,john,2,0,0,1,7
7000,john,2,0,0,1,8
7000,john,2,0,0,1,9
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1
7003,mike,1,0,0,2,2
7003,mike,1,0,0,2,3
7003,mike,1,0,0,2,4
8001,nike,1,2,4,1,8
8002,paul,2,0,0,2,7

Output

$ awk -F, 'NR==FNR{a[$1,$2,$3,$4]++;next} (a[$1,$3,$6,$7])' file1.txt file2.txt
7000,john,2,0,0,1,6
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1

Alternatively, you could also use the following syntax which more closely matches the one in your question but is not very readable IMHO

awk -F, 'NR==FNR{a[$1,$2,$3,$4];next} ($1SUBSEP$3SUBSEP$6SUBSEP$7 in a)' file1.txt file2.txt

Upvotes: 3

pdehaan
pdehaan

Reputation: 342

It's not an elegant one-liner, but you could do it with perl.

#!/usr/bin/perl
open A, $ARGV[0];
while(split/,/,<A>) {
    $k{$_[0]} = [@_];
}
close A;

open B, $ARGV[1];
while(split/,/,<B>) {
    print join(',',@{$k{$_[0]}}) if
        defined($k{$_[0]}) &&
        $k{$_[0]}->[2] == $_[1] &&
        $k{$_[0]}->[5] == $_[2] &&
        $k{$_[0]}->[6] == $_[3];
}
close B;

Upvotes: 1

Steve Weet
Steve Weet

Reputation: 28392

This is neither efficient nor pretty it will however get the job done. It is not the most efficient implementation as it parses file1 multiple times however it does not read the entire file into RAM either so has some benefits over the simple scripting approaches.

sed -n '2,$p' file1 | awk -F, '{print $1 "," $3 "," $6 "," $7 " " $0 }' | \
sort | join file2 - |awk '{print $2}'

This works as follows

  1. sed -n '2,$p' file1 sends file1 to STDOUT without the header line
  2. The first awk command prints the 4 "key fields" from file1 in the same format as they are in file2 followed by a space followed by the contents of file1
  3. The sort command ensures that file1 is in the same order as file2
  4. The join command joins file2 and STDOUT only writing records that have a matching record in file2
  5. The final awk command prints just the original part of file1

In order for this to work you must ensure that file2 is sorted before running the command.

Running this against your example data gave the following result

7000,john,2,0,0,1,6
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1

EDIT

I note from your comments you are getting a sorting error. If this error is occuring when sorting file2 before running the pipeline command then you could split the file, sort each part and then cat them back together again.

Something like this would do that for you

mv file2 file2.orig
for i in 0 1 2 3 4 5 6 7 8 9
do
  grep "^${i}" file2.orig |sort > file2.$i
done
cat file2.[0-9] >file2
rm file2.[0-9] file2.orig

You may need to modify the variables passed to for if your file is not distributed evenly across the full range of leading digits.

Upvotes: 0

ninjalj
ninjalj

Reputation: 43688

Not really well tested, but this might work:

join -t, file1 file2 | awk -F, 'BEGIN{OFS=","} {if ($3==$8 && $6==$9 && $7==$10) print $1,$2,$3,$4,$6,$7}'

(Of course, this assumes the input files are sorted).

Upvotes: 0

Roman Cheplyaka
Roman Cheplyaka

Reputation: 38718

TxtSushi looks like what you want. It allows to work with CSV files using SQL.

Upvotes: 1

Carl Smotricz
Carl Smotricz

Reputation: 67760

Quick answer: Use cut to split out the fields you need and diff to compare the results.

Upvotes: 0

Related Questions