Reputation: 455
I have been searching in the forum a solution for my problem but unfortunately I was not successful. I am quite green in bash; therefore I would really appreciate any suggestion ot link to previous question/discussions
I have two files:
file1
Ada ada
Ada ada
Ada aidB
Ada alkA
Ada alkB
Ada alkB
AdiY adiA
AdiY gadA
file2
NP_414557.1 insL1
NP_414559.1 mokC
YP_025292.1 hokC
NP_414560.1 nhaA
NP_414561.1 nhaR
NP_414562.1 insB1
NP_414563.1 insA
NP_414564.1 rpsT
I would like to compare file1 and file2 based on the second column. If the object is the same I would like to append the text in the column 1 in file2 relative to that specific shared entry in a third column in file1.
Expected output: file1
PhoB caiT NP_414580.1
PhoP caiE NP_414581.1
PhoP caiB NP_414582.1
Please consider the text reported just as example.
I am struggling with awk trying to adjust the code from a previouse question:
Compare two files of different columns and print different columns
awk 'NR==FNR{a[$1]=$2 ; next }$2 in a {$1=$1 FS a[$2]print}' file1 file2
Thank you very much in advance for your time and your help.
Best.
Upvotes: 3
Views: 5520
Reputation: 11786
As @fedorqui commented, your example inputs/output are not consistent. I think this should do the trick though:
awk 'NR==FNR{a[$2]=$0; next} a[$2]>0{print a[$2],$1}' file1 file2
file1:
A alice
B bob
C carol
D dan
file2:
1 dan
2 alice
3 carol
4 bob
Output:
$ awk 'NR==FNR{a[$2]=$0} NR>FNR && a[$2]>0{print a[$2],$1}' file1 file2
D dan 1
A alice 2
C carol 3
B bob 4
Output can be sorted by whatever column you choose using sort -k
. Breaking down the awk
code:
NR==FNR{a[$2]=$0; next}
- NR
is an awk
variable which contains the total number of lines processed so far. FNR
is similar, but only contains the number of processes lines of the current file, so this condition effectively means "only do this for the first input file". The associated action stores the entire line ($0
) in the associative array a
, with the index being the value of the second field of the row. next
just means that awk
should move to the next line without doing further processing.
a[$2]>0{print a[$2],$1}
- this condition will only be tested for the second and subsequent input files. The second field is used as an index to look up a value from the a
- if the value is greater than 0 (ie, the value is a string in this case) then the value is printed, followed by the first field of the current line.
Basically, every line of the first file is stored in an array, indexed by the second field of the line. If that second field is matched in the second file, then the whole line from the first file is printed, with the second field from the second file appended.
Upvotes: 3
Reputation: 72
You can use join command to link fields on files. The simplest way to use it can be:
join -j 2
to obtain an output with the first and second file of first file and following the first field of second file you can use the -o option as follow:
join -j 2 -o 1.1 1.2 2.1
it is required that the lines in both files are sorted on second field, if the lines are not correctly sorted you can use the following bash trick:
join -j 2 -o 1.1 1.2 2.1 <(sort -k2 1) <(sort -k2 2)
Supposing this is the content of two files:
$ cat 1
FIRST first
SECOND second
THIRD third
FOURTH fourth
$ cat 2
2 second
1 first
3 third
4 fourth
Note: the field separator on each line is the TAB character
The final result:
$ join -j 2 -o 1.1 1.2 2.1 <(sort 1) <(sort -k2 2)
FIRST first 1
FOURTH fourth 4
SECOND second 2
THIRD third 3
Upvotes: 1