Reputation: 11
I have two data files, each with 2 columns, and I need to combine them such that the first two columns are retained from the first file, and the first column of the second file is appended when the second column matches the first column of the first file. However, there may be multiple entries that match each line, and they all need to be added to the same line, separated by spaces.
File 1:
GO:0000001 mitochondrion|inheritance
GO:0000002 mitochondrial|genome|maintenance
GO:0000003 reproduction
GO:0000005 obsolete|ribosomal|chaperone|activity
File 2:
TMALIG_00480 GO:0000003
TMALIG_00489 GO:0000002
TMALIG_00499 GO:0000001
TMALIG_00499 GO:0000005
TMALIG_00500 GO:0000002
Desired output:
GO:0000001 mitochondrion|inheritance TMALIG_00499
GO:0000002 mitochondrial|genome|maintenance TMALIG_00489 TMALIG_00500
GO:0000003 reproduction TMALIG_00480
GO:0000005 obsolete|ribosomal|chaperone|activity TMALIG_00499
I have tried the following code, which only returns the values from File 1, without adding any from File 2:
awk -F '\t' 'FR==FNR { a[$2]=1; next} {print $1, $2, a[$1]}' go.db.txt TMALIG_GOids.txt
GO:0000001 mitochondrion|inheritance
GO:0000002 mitochondrial|genome|maintenance
GO:0000003 reproduction
GO:0000005 obsolete|ribosomal|chaperone|activity
Upvotes: 1
Views: 61
Reputation: 113824
You were close:
$ awk -F '\t' 'NR==FNR { a[$2]=($2 in a?a[$2]" ":"")$1; next} {print $1, $2, a[$1]}' OFS='\t' file2 file1
GO:0000001 mitochondrion|inheritance TMALIG_00499
GO:0000002 mitochondrial|genome|maintenance TMALIG_00489 TMALIG_00500
GO:0000003 reproduction TMALIG_00480
GO:0000005 obsolete|ribosomal|chaperone|activity TMALIG_00499
-F '\t'
This sets the field separator on input to a tab.
NR==FNR { a[$2]=($2 in a?a[$2]"\t":"")$1; next}
This captures the information in file2 into associative array a
.
The tricky part here is that, if a[$1]
already has an entry, we append a space and field 1, $1
, after that existing entry. Otherwise, a[$2]
is just set to $1
. This is accomplished with the aid of a ternary statement.
print $1, $2, a[$1]
This prints out the new line with the values from file2 appended at the end of the line.
OFS='\t'
This tells awk to use a tab for the output field separator.
Upvotes: 1