tiggersandtigers
tiggersandtigers

Reputation: 11

Need to combine columns from 2 files, 1 of which has multiple entries per line in the final file

I have two data files, each with 2 columns, and I need to combine them such that the first two columns are retained from the first file, and the first column of the second file is appended when the second column matches the first column of the first file. However, there may be multiple entries that match each line, and they all need to be added to the same line, separated by spaces.

File 1:

GO:0000001  mitochondrion|inheritance
GO:0000002  mitochondrial|genome|maintenance
GO:0000003  reproduction
GO:0000005  obsolete|ribosomal|chaperone|activity

File 2:

TMALIG_00480     GO:0000003
TMALIG_00489     GO:0000002
TMALIG_00499     GO:0000001
TMALIG_00499     GO:0000005
TMALIG_00500     GO:0000002

Desired output:

GO:0000001  mitochondrion|inheritance   TMALIG_00499
GO:0000002  mitochondrial|genome|maintenance    TMALIG_00489 TMALIG_00500
GO:0000003  reproduction    TMALIG_00480
GO:0000005  obsolete|ribosomal|chaperone|activity   TMALIG_00499

I have tried the following code, which only returns the values from File 1, without adding any from File 2:

awk -F '\t' 'FR==FNR { a[$2]=1; next} {print $1, $2, a[$1]}' go.db.txt TMALIG_GOids.txt

GO:0000001  mitochondrion|inheritance
GO:0000002  mitochondrial|genome|maintenance
GO:0000003  reproduction
GO:0000005  obsolete|ribosomal|chaperone|activity

Upvotes: 1

Views: 61

Answers (1)

John1024
John1024

Reputation: 113824

You were close:

$ awk -F '\t' 'NR==FNR { a[$2]=($2 in a?a[$2]" ":"")$1; next} {print $1, $2, a[$1]}' OFS='\t' file2 file1
GO:0000001      mitochondrion|inheritance       TMALIG_00499
GO:0000002      mitochondrial|genome|maintenance        TMALIG_00489 TMALIG_00500
GO:0000003      reproduction    TMALIG_00480
GO:0000005      obsolete|ribosomal|chaperone|activity   TMALIG_00499

How it works

  • -F '\t'

    This sets the field separator on input to a tab.

  • NR==FNR { a[$2]=($2 in a?a[$2]"\t":"")$1; next}

    This captures the information in file2 into associative array a.

    The tricky part here is that, if a[$1] already has an entry, we append a space and field 1, $1, after that existing entry. Otherwise, a[$2] is just set to $1. This is accomplished with the aid of a ternary statement.

  • print $1, $2, a[$1]

    This prints out the new line with the values from file2 appended at the end of the line.

  • OFS='\t'

    This tells awk to use a tab for the output field separator.

Upvotes: 1

Related Questions