Bash: find non-common rows based on a second column

Question

I have pairs of files that look like so:

File_1A.txt
SNP1 pos1
SNP2 pos2
SNP3 pos3
SNP4 pos4
SNP5 pos5
SNP7 pos7

File_1B.txt
SNP1 pos1
SNP2 pos2
SNP3 pos3
SNP5 pos5
SNP6 pos6
SNP7 pos7

More descriptions about these 2 files:

They share most but not all of their SNPIDs: i.e. the actually name of the SNPs may differ. For example, SNP1 may be called SNP1a in one and SNP1b in the other. This means I can't compare the files based on column1. I need to use column2.
The values (they are numbers in my file) in column 2 are unique - i.e. there are no duplicates within each file.

Based on column2, I want to find the rows: - That are present in file_1A.txt but not in file_1B.txt - That are present in file_1B.txt but not in file_1B.txt In this example, my output would give me:

SNP4 pos4
SNP6 pos6

I've been looking around at commands such as diff but they always give an output of rows which are not in one compared to the other. But how can I find rows not present in one and vice-versa?

Many thanks.

EDIT: Apologies, to make things clearer, here is how my real file looks like:

File_1A.txt

rs13339951:45007956:T:C 45007956
rs2838331 45026728
rs5647 12345

File_1B.txt

rs13339951 45007956
rs2838331 45026728
rs55778 1235597

From this file, I should get these rows only:

rs5647 12345
rs55778 1235597

RavinderSingh13 · Accepted Answer

If you are not bothered about the order of the output eg--> it should like Input_file(s) then following may help you in same.

awk 'FNR==NR{a[$0]=$0;next} !($0 in a){print;next} {delete a[$0]} END{for(i in a){print i}}' File_1A.txt File_1B.txt

Adding a non-one liner form of solution too.

awk '
FNR==NR{
 a[$0]=$0;
 next
}
!($0 in a){
 print;
 next
}
{
 delete a[$0]
}
END{
 for(i in a){
   print i
}
}
' File_1A.txt File_1B.txt

It will make sure to print all those values which are NOT present in File_1A.txt and present in File_1B.txt and vice versa too. Will add explanation shortly too.

Explanation of code: FNR==NR is a condition which will be TRUE when the very fist Input_file is being read. Now difference between FNR and NR is both indicates line number BUT FNR's value will be RESET whenever awk starts to read next file and NR's value will be keep on increasing till all Input_file(s) are being done with reading.

awk '
FNR==NR{                 ##Mentioning condition FNR==NR which will be TRUE only when first Input_file named File_1A.txt will be read.
 a[$0]=$0;               ##creating an array named a whose index is current line and value is too current line.
 next                    ##next will skip all further statements.
}
!($0 in a){              ##Checking here condition if current line is not in array a. If this condition is TRUE then enter to following block.
 print;                  ##print the current line of Input_file named File_1B.txt, which means it is not present in Input_file File_1A.txt.
 next                    ##next will skip all further statements.
}
{
 delete a[$0]            ##If above condition is NOT TRUE then it will simply delete the array a element whose index is current line because it is common in files.
}
END{
 for(i in a){            ##Starting a usual for loop here. Which is traversing through array a all elements.
   print i               ##Printing the index of array a, which will print actually those lines which are present in Input_file File_1A.txt and NOT in File_1B.txt.
}
}
' File_1A.txt File_1B.txt

EDIT2: AS op changed field separator so changed the code accordingly now. Not removing previous codes as it may help people with previous kind of Input_file(s) data.

awk '
FNR==NR{                 ##Mentioning condition FNR==NR which will be TRUE only when first Input_file named File_1A.txt will be read.
 a[$1]=$0;               ##creating an array named a whose index is current line and value is too current line.
 next                    ##next will skip all further statements.
}
!($1 in a){              ##Checking here condition if current line is not in array a. If this condition is TRUE then enter to following block.
 print;                  ##print the current line of Input_file named File_1B.txt, which means it is not present in Input_file File_1A.txt.
 next                    ##next will skip all further statements.
}
{
 delete a[$1]            ##If above condition is NOT TRUE then it will simply delete the array a element whose index is current line because it is common in files.
}
END{
 for(i in a){            ##Starting a usual for loop here. Which is traversing through array a all elements.
   print a[i]            ##Printing the index of array a, which will print actually those lines which are present in Input_file File_1A.txt and NOT in File_1B.txt.
}
}
' FS=':| ' File_1A.txt File_1B.txt

Bash: find non-common rows based on a second column

Answers (2)

Related Questions