Alina
Alina

Reputation: 23

Comparing two files and print similar lines in an another file

I have two files named file1 and file2. I want to compare both files and print the similar lines in another file. I used awk and grep but didn't get any solution. I checked so many answers related to my questions but also not so helpful in my case.

File 1:

 UDP-N-acetylenolpyruvoylglucosamine reductase K00075 murB; UDP-N-acetylmuramate dehydrogenase EC:1.3.1.98
 UDP-N-acetylglucosamine 1-carboxyvinyltransferase K00790 murA; UDP-N-acetylglucosamine 1-carboxyvinyltransferase EC:2.5.1.7
 UDP-N-acetylmuramate--L-alanine ligase K01924 murC; UDP-N-acetylmuramate--alanine ligase EC:6.3.2.8

File 2:

D      ZPR_2530 UDP-N-acetylglucosamine 1-carboxyvinyltransferase K00790 murA; UDP-N-acetylglucosamine 1-carboxyvinyltransferase EC:2.5.1.7
D      ZPR_3743 UDP-N-acetylenolpyruvoylglucosamine reductase K00075 murB; UDP-N-acetylmuramate dehydrogenase EC:1.3.1.98
D      ZPR_3807 UDP-N-acetylmuramate--L-alanine ligase K01924 murC; UDP-N-acetylmuramate--alanine ligase EC:6.3.2.8
D      ZPR_3810 UDP-N-acetylmuramoylalanine--D-glutamate ligase K01925 murD; UDP-N-acetylmuramoylalanine--D-glutamate ligase EC:6.3.2.9
D      ZPR_3812 UDP-N-acetylmuramoylalanyl-D-glutamate--2 K01928 murE; UDP-N-acetylmuramoyl-L-alanyl-D-glutamate--2,6-diaminopimelate ligase EC:6.3.2.13
D      ZPR_0820 D-alanyl-alanine synthetase A K01921 ddl; D-alanine-D-alanine ligase EC:6.3.2.4
D      ZPR_3928 UDP-N-acetylmuramoyl-tripeptide--D-alanyl-D-alanine ligase K01929 murF; UDP-N-acetylmuramoyl-tripeptide--D-alanyl-D-alanine ligase EC:6.3.2.10
D      ZPR_4441 putative undecaprenol kinase K06153 bacA; undecaprenyl-diphosphatase EC:3.6.1.27
D      ZPR_3043 PAP2 superfamily membrane protein K19302 bcrC; undecaprenyl-diphosphatase EC:3.6.1.27

Expected output:

D      ZPR_3743 UDP-N-acetylenolpyruvoylglucosamine reductase K00075 murB; UDP-N-acetylmuramate dehydrogenase EC:1.3.1.98
D      ZPR_2530 UDP-N-acetylglucosamine 1-carboxyvinyltransferase K00790 murA; UDP-N-acetylglucosamine 1-carboxyvinyltransferase EC:2.5.1.7
D      ZPR_3807 UDP-N-acetylmuramate--L-alanine ligase K01924 murC; UDP-N-acetylmuramate--alanine ligase EC:6.3.2.8

Command that I used:

awk 'NR==FNR{a[$1]=$NF;next;} {print ($0 ? a[$1] OFS $0 :$0)}' no-tab-file.txt Non-homo-Dzpr00001.txt

Also:

grep -Ff Non-homo-Dzpr00001.txt no-tab-file.txt

Upvotes: 0

Views: 131

Answers (3)

RavinderSingh13
RavinderSingh13

Reputation: 133428

EDIT: After working with OP following is the command which worked for her.

awk 'FNR==NR{a[$0];next} {for(i in a){if(match($0,i)){print;next}}}' file1 file2


Could you please try following, written and tested with your shown samples only. Also considering that your Input_file1 has initial spaces in its lines(as per shown samples).

awk '
FNR==NR{
  a[$0]
  next
}
{
  val=$0
  sub(/ +/,"",val)
  sub(/[^ ]*/,"",val)
}
val in a
'  file1 file2

Explanation: Adding detailed explanation for above code here.

awk '                       ##Starting awk program from here.
FNR==NR{                    ##Checking condition FNR==NR which will be TRUE when 1st Input_file named file1 is being read.
  a[$0]                     ##Creating an array named a whose index is $0(current line value).
  next                      ##next will skip all further statements from here.
}                           ##Closing BLOCK for FNR==NR condition here.
{
  val=$0                    ##Creating a variable named val whose value is $0 here.
  sub(/ +/,"",val)          ##Substitute initial space with NULL in variable val.
  sub(/[^ ]*/,"",val)       ##Substituting everything till first space comes in variable val here.
}
val in a                    ##Checking condition if variable val is present in array a then print that line.
'  file1 file2              ##Mentioning Input_file names here.

Output will be as follows.

D      ZPR_2530 UDP-N-acetylglucosamine 1-carboxyvinyltransferase K00790 murA; UDP-N-acetylglucosamine 1-carboxyvinyltransferase EC:2.5.1.7
D      ZPR_3743 UDP-N-acetylenolpyruvoylglucosamine reductase K00075 murB; UDP-N-acetylmuramate dehydrogenase EC:1.3.1.98
D      ZPR_3807 UDP-N-acetylmuramate--L-alanine ligase K01924 murC; UDP-N-acetylmuramate--alanine ligase EC:6.3.2.8

Upvotes: 1

Gopika BG
Gopika BG

Reputation: 817

please try the following:

grep -Fxf file1 file2 > file3.txt

Upvotes: 0

Filip Młynarski
Filip Młynarski

Reputation: 3612

Here's example solution without bash tools using python

f_1 = open('Non-homo-Dzpr00001.txt').read().splitlines()
f_2 = [i.split(maxsplit=2)[1:] for i in open('no-tab-file.txt').read().splitlines()]
f_3_content = '\n'.join(j[0]+i if len(i) < len(j[1]) else j[0]+j[1] for i, j in zip(f_1, f_2))

with open('file_3', 'w') as f:
    f.write(f_3_content)

Upvotes: 0

Related Questions