Reputation: 23
I have two files named file1 and file2. I want to compare both files and print the similar lines in another file. I used awk
and grep
but didn't get any solution. I checked so many answers related to my questions but also not so helpful in my case.
File 1:
UDP-N-acetylenolpyruvoylglucosamine reductase K00075 murB; UDP-N-acetylmuramate dehydrogenase EC:1.3.1.98
UDP-N-acetylglucosamine 1-carboxyvinyltransferase K00790 murA; UDP-N-acetylglucosamine 1-carboxyvinyltransferase EC:2.5.1.7
UDP-N-acetylmuramate--L-alanine ligase K01924 murC; UDP-N-acetylmuramate--alanine ligase EC:6.3.2.8
File 2:
D ZPR_2530 UDP-N-acetylglucosamine 1-carboxyvinyltransferase K00790 murA; UDP-N-acetylglucosamine 1-carboxyvinyltransferase EC:2.5.1.7
D ZPR_3743 UDP-N-acetylenolpyruvoylglucosamine reductase K00075 murB; UDP-N-acetylmuramate dehydrogenase EC:1.3.1.98
D ZPR_3807 UDP-N-acetylmuramate--L-alanine ligase K01924 murC; UDP-N-acetylmuramate--alanine ligase EC:6.3.2.8
D ZPR_3810 UDP-N-acetylmuramoylalanine--D-glutamate ligase K01925 murD; UDP-N-acetylmuramoylalanine--D-glutamate ligase EC:6.3.2.9
D ZPR_3812 UDP-N-acetylmuramoylalanyl-D-glutamate--2 K01928 murE; UDP-N-acetylmuramoyl-L-alanyl-D-glutamate--2,6-diaminopimelate ligase EC:6.3.2.13
D ZPR_0820 D-alanyl-alanine synthetase A K01921 ddl; D-alanine-D-alanine ligase EC:6.3.2.4
D ZPR_3928 UDP-N-acetylmuramoyl-tripeptide--D-alanyl-D-alanine ligase K01929 murF; UDP-N-acetylmuramoyl-tripeptide--D-alanyl-D-alanine ligase EC:6.3.2.10
D ZPR_4441 putative undecaprenol kinase K06153 bacA; undecaprenyl-diphosphatase EC:3.6.1.27
D ZPR_3043 PAP2 superfamily membrane protein K19302 bcrC; undecaprenyl-diphosphatase EC:3.6.1.27
Expected output:
D ZPR_3743 UDP-N-acetylenolpyruvoylglucosamine reductase K00075 murB; UDP-N-acetylmuramate dehydrogenase EC:1.3.1.98
D ZPR_2530 UDP-N-acetylglucosamine 1-carboxyvinyltransferase K00790 murA; UDP-N-acetylglucosamine 1-carboxyvinyltransferase EC:2.5.1.7
D ZPR_3807 UDP-N-acetylmuramate--L-alanine ligase K01924 murC; UDP-N-acetylmuramate--alanine ligase EC:6.3.2.8
Command that I used:
awk 'NR==FNR{a[$1]=$NF;next;} {print ($0 ? a[$1] OFS $0 :$0)}' no-tab-file.txt Non-homo-Dzpr00001.txt
Also:
grep -Ff Non-homo-Dzpr00001.txt no-tab-file.txt
Upvotes: 0
Views: 131
Reputation: 133428
EDIT: After working with OP following is the command which worked for her.
awk 'FNR==NR{a[$0];next} {for(i in a){if(match($0,i)){print;next}}}' file1 file2
Could you please try following, written and tested with your shown samples only. Also considering that your Input_file1 has initial spaces in its lines(as per shown samples).
awk '
FNR==NR{
a[$0]
next
}
{
val=$0
sub(/ +/,"",val)
sub(/[^ ]*/,"",val)
}
val in a
' file1 file2
Explanation: Adding detailed explanation for above code here.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when 1st Input_file named file1 is being read.
a[$0] ##Creating an array named a whose index is $0(current line value).
next ##next will skip all further statements from here.
} ##Closing BLOCK for FNR==NR condition here.
{
val=$0 ##Creating a variable named val whose value is $0 here.
sub(/ +/,"",val) ##Substitute initial space with NULL in variable val.
sub(/[^ ]*/,"",val) ##Substituting everything till first space comes in variable val here.
}
val in a ##Checking condition if variable val is present in array a then print that line.
' file1 file2 ##Mentioning Input_file names here.
Output will be as follows.
D ZPR_2530 UDP-N-acetylglucosamine 1-carboxyvinyltransferase K00790 murA; UDP-N-acetylglucosamine 1-carboxyvinyltransferase EC:2.5.1.7
D ZPR_3743 UDP-N-acetylenolpyruvoylglucosamine reductase K00075 murB; UDP-N-acetylmuramate dehydrogenase EC:1.3.1.98
D ZPR_3807 UDP-N-acetylmuramate--L-alanine ligase K01924 murC; UDP-N-acetylmuramate--alanine ligase EC:6.3.2.8
Upvotes: 1
Reputation: 3612
Here's example solution without bash tools using python
f_1 = open('Non-homo-Dzpr00001.txt').read().splitlines()
f_2 = [i.split(maxsplit=2)[1:] for i in open('no-tab-file.txt').read().splitlines()]
f_3_content = '\n'.join(j[0]+i if len(i) < len(j[1]) else j[0]+j[1] for i, j in zip(f_1, f_2))
with open('file_3', 'w') as f:
f.write(f_3_content)
Upvotes: 0