Vivek Nair
Vivek Nair

Reputation: 21

find lines existing in one file and not in another, based on a portion of the line

I have two files A.dat and B.dat.

A.dat

112381550RSAP002839002C00000000020200600000110102020-05-26
112539961RSAP002839002C00000000020200700000140102020-05-26
140823748RSAP002839002C00000000020210200000050102020-05-26
110604754RSAP002839002C00000000020200600000110102020-05-26

B.dat

112381550RSAP002839002C00000000020200600000000102020-05-26
112539961RSAP002839002C00000000020200700000000102020-05-26
119A06559RSAP002839002C00000000020210100000000102020-05-26
119231672RSAP002839002C00000000020200900000000102020-05-26
118372226RSAP002839002C00000000020200800000000102020-05-26

I want to find records in B.dat that do not exist in A.dat based on the first 22 characters (in BOLD) the output should be below

119A06559RSAP002839002C00000000020210100000000102020-05-26
119231672RSAP002839002C00000000020200900000000102020-05-26
118372226RSAP002839002C00000000020200800000000102020-05-26

Tried using grep like below

grep -Fvxf B.dat A.dat > c.dat 

But didn't find a way to compare only that portion of the data.

Upvotes: 2

Views: 107

Answers (4)

agc
agc

Reputation: 8406

If the order of the output is unimportant, here's a grep-free method using bash, sort, and GNU uniq:

sort {A,A,B}.dat | uniq -uw 22

...or in POSIX shell:

sort A.dat A.dat B.dat | uniq -uw 22

Output of either method:

118372226RSAP002839002C00000000020200800000000102020-05-26
119231672RSAP002839002C00000000020200900000000102020-05-26
119A06559RSAP002839002C00000000020210100000000102020-05-26

Upvotes: 2

Thomas Kammeyer
Thomas Kammeyer

Reputation: 4507

You can do this with just grep and colrm as follows (a filename of "-" is understood as stdin and you can use that with "-f"):

colrm 23 < A.dat | grep -F -v -f - B.dat

If you're not 100% sure those 22-character patterns are going to match only at the starts of lines, you need to add a '^' to each line of output from colrm and elide the "-F" flag from grep's flags, like so:

colrm 23 < A.dat | sed -e 's/^/\^/;' | grep -v -f - B.dat

Upvotes: 2

RavinderSingh13
RavinderSingh13

Reputation: 133458

Could you please try the following.

awk 'FNR==NR{array[substr($0,1,22)];next} !(substr($0,1,22) in array)'  A.dat B.dat

Explanation: Adding detailed explanation for above.

awk '                             ##Starting awk program from here.
FNR==NR{                          ##Checking condition if FNR==NR then do following.
  array[substr($0,1,22)]          ##Creating an array whose index is first 22 elements of current line.
  next                            ##next will skip all further statements from here.
}
!(substr($0,1,22) in array)       ##Checking condition if current line first 22 characters are NOT in array the print the current line.
'  A.dat B.dat                    ##Mentioning Input_file names here.

Upvotes: 4

kvantour
kvantour

Reputation: 26471

I would use the following method based on awk:

awk '{s=substr($0,1,22)}(FNR==NR){a[s];next}!(s in a)' A.dat B.dat

This ensures that you will always match the first 22 characters.

It essentially does the following: everytime a line is read (disregarding the file) it creates a little string s containing the first 22 characters of the line. If we process the first file (FNR==NR) store the string in an array a, if we process the second file, check if that string is a member of a and if not, print the line.

You could also attempt a grep based solution, but this could lead to false positives, depending on how you like your input:

cut -c1-22 A.dat | grep -vFf - B.dat

This however could match the first 22 characters of the lines of A.dat anywhere in the lines of B.dat (not necessarily the first 22 characters)

Upvotes: 3

Related Questions