Reputation: 21
I have two files A.dat and B.dat.
A.dat
112381550RSAP002839002C00000000020200600000110102020-05-26
112539961RSAP002839002C00000000020200700000140102020-05-26
140823748RSAP002839002C00000000020210200000050102020-05-26
110604754RSAP002839002C00000000020200600000110102020-05-26
B.dat
112381550RSAP002839002C00000000020200600000000102020-05-26
112539961RSAP002839002C00000000020200700000000102020-05-26
119A06559RSAP002839002C00000000020210100000000102020-05-26
119231672RSAP002839002C00000000020200900000000102020-05-26
118372226RSAP002839002C00000000020200800000000102020-05-26
I want to find records in B.dat that do not exist in A.dat based on the first 22 characters (in BOLD) the output should be below
119A06559RSAP002839002C00000000020210100000000102020-05-26 119231672RSAP002839002C00000000020200900000000102020-05-26 118372226RSAP002839002C00000000020200800000000102020-05-26
Tried using grep
like below
grep -Fvxf B.dat A.dat > c.dat
But didn't find a way to compare only that portion of the data.
Upvotes: 2
Views: 107
Reputation: 8406
If the order of the output is unimportant, here's a grep
-free method using bash
, sort
, and GNU uniq
:
sort {A,A,B}.dat | uniq -uw 22
...or in POSIX shell:
sort A.dat A.dat B.dat | uniq -uw 22
Output of either method:
118372226RSAP002839002C00000000020200800000000102020-05-26
119231672RSAP002839002C00000000020200900000000102020-05-26
119A06559RSAP002839002C00000000020210100000000102020-05-26
Upvotes: 2
Reputation: 4507
You can do this with just grep and colrm as follows (a filename of "-" is understood as stdin and you can use that with "-f"):
colrm 23 < A.dat | grep -F -v -f - B.dat
If you're not 100% sure those 22-character patterns are going to match only at the starts of lines, you need to add a '^' to each line of output from colrm and elide the "-F" flag from grep's flags, like so:
colrm 23 < A.dat | sed -e 's/^/\^/;' | grep -v -f - B.dat
Upvotes: 2
Reputation: 133458
Could you please try the following.
awk 'FNR==NR{array[substr($0,1,22)];next} !(substr($0,1,22) in array)' A.dat B.dat
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition if FNR==NR then do following.
array[substr($0,1,22)] ##Creating an array whose index is first 22 elements of current line.
next ##next will skip all further statements from here.
}
!(substr($0,1,22) in array) ##Checking condition if current line first 22 characters are NOT in array the print the current line.
' A.dat B.dat ##Mentioning Input_file names here.
Upvotes: 4
Reputation: 26471
I would use the following method based on awk
:
awk '{s=substr($0,1,22)}(FNR==NR){a[s];next}!(s in a)' A.dat B.dat
This ensures that you will always match the first 22 characters.
It essentially does the following: everytime a line is read (disregarding the file) it creates a little string s
containing the first 22 characters of the line. If we process the first file (FNR==NR)
store the string in an array a
, if we process the second file, check if that string is a member of a
and if not, print the line.
You could also attempt a grep
based solution, but this could lead to false positives, depending on how you like your input:
cut -c1-22 A.dat | grep -vFf - B.dat
This however could match the first 22 characters of the lines of A.dat
anywhere in the lines of B.dat
(not necessarily the first 22 characters)
Upvotes: 3