mf94
mf94

Reputation: 479

Bash: find non-common rows based on a second column

I have pairs of files that look like so:

File_1A.txt
SNP1 pos1
SNP2 pos2
SNP3 pos3
SNP4 pos4
SNP5 pos5
SNP7 pos7

File_1B.txt
SNP1 pos1
SNP2 pos2
SNP3 pos3
SNP5 pos5
SNP6 pos6
SNP7 pos7

More descriptions about these 2 files:

Based on column2, I want to find the rows: - That are present in file_1A.txt but not in file_1B.txt - That are present in file_1B.txt but not in file_1B.txt In this example, my output would give me:

SNP4 pos4
SNP6 pos6    

I've been looking around at commands such as diff but they always give an output of rows which are not in one compared to the other. But how can I find rows not present in one and vice-versa?

Many thanks.

EDIT: Apologies, to make things clearer, here is how my real file looks like:

File_1A.txt

rs13339951:45007956:T:C 45007956
rs2838331 45026728
rs5647 12345

File_1B.txt

rs13339951 45007956
rs2838331 45026728
rs55778 1235597

From this file, I should get these rows only:

rs5647 12345
rs55778 1235597

Upvotes: 1

Views: 94

Answers (2)

James Brown
James Brown

Reputation: 37464

If there are no duplicates within each file, you could just:

$ awk '$2 in a{delete a[$2];next}{a[$2]=$0}END{for(i in a) print a[i]}' filea fileb
SNP6 pos6
SNP4 pos4

Explained:

$2 in a {           # if 2nd column value is already hashed in a
    delete a[$2]    # delete it and skip to...
    next }          # next record
{
    a[$2]=$0 }      # else hash the record to, $2 as key
END {               # after both files pairless will remain in a
    for(i in a)     # iterate and
        print a[i]  # output them
}

Upvotes: 1

RavinderSingh13
RavinderSingh13

Reputation: 133760

If you are not bothered about the order of the output eg--> it should like Input_file(s) then following may help you in same.

awk 'FNR==NR{a[$0]=$0;next} !($0 in a){print;next} {delete a[$0]} END{for(i in a){print i}}' File_1A.txt File_1B.txt

Adding a non-one liner form of solution too.

awk '
FNR==NR{
 a[$0]=$0;
 next
}
!($0 in a){
 print;
 next
}
{
 delete a[$0]
}
END{
 for(i in a){
   print i
}
}
' File_1A.txt File_1B.txt

It will make sure to print all those values which are NOT present in File_1A.txt and present in File_1B.txt and vice versa too. Will add explanation shortly too.

Explanation of code: FNR==NR is a condition which will be TRUE when the very fist Input_file is being read. Now difference between FNR and NR is both indicates line number BUT FNR's value will be RESET whenever awk starts to read next file and NR's value will be keep on increasing till all Input_file(s) are being done with reading.

awk '
FNR==NR{                 ##Mentioning condition FNR==NR which will be TRUE only when first Input_file named File_1A.txt will be read.
 a[$0]=$0;               ##creating an array named a whose index is current line and value is too current line.
 next                    ##next will skip all further statements.
}
!($0 in a){              ##Checking here condition if current line is not in array a. If this condition is TRUE then enter to following block.
 print;                  ##print the current line of Input_file named File_1B.txt, which means it is not present in Input_file File_1A.txt.
 next                    ##next will skip all further statements.
}
{
 delete a[$0]            ##If above condition is NOT TRUE then it will simply delete the array a element whose index is current line because it is common in files.
}
END{
 for(i in a){            ##Starting a usual for loop here. Which is traversing through array a all elements.
   print i               ##Printing the index of array a, which will print actually those lines which are present in Input_file File_1A.txt and NOT in File_1B.txt.
}
}
' File_1A.txt File_1B.txt

EDIT2: AS op changed field separator so changed the code accordingly now. Not removing previous codes as it may help people with previous kind of Input_file(s) data.

awk '
FNR==NR{                 ##Mentioning condition FNR==NR which will be TRUE only when first Input_file named File_1A.txt will be read.
 a[$1]=$0;               ##creating an array named a whose index is current line and value is too current line.
 next                    ##next will skip all further statements.
}
!($1 in a){              ##Checking here condition if current line is not in array a. If this condition is TRUE then enter to following block.
 print;                  ##print the current line of Input_file named File_1B.txt, which means it is not present in Input_file File_1A.txt.
 next                    ##next will skip all further statements.
}
{
 delete a[$1]            ##If above condition is NOT TRUE then it will simply delete the array a element whose index is current line because it is common in files.
}
END{
 for(i in a){            ##Starting a usual for loop here. Which is traversing through array a all elements.
   print a[i]            ##Printing the index of array a, which will print actually those lines which are present in Input_file File_1A.txt and NOT in File_1B.txt.
}
}
' FS=':| ' File_1A.txt File_1B.txt

Upvotes: 1

Related Questions