psaima
psaima

Reputation: 61

combining multiple column values in lines based on identical column

I need help to improve my working code for combining multiple rows based on identical values for specified columns. Here is a sample data:

c-i1_pos-at1-v2 162a    AT1G01040.1 2   3422-3443   3433    1
c-i1_pos-at1-v2 162b    AT1G01040.1 2   3422-3443   3433    1
pare-i_226-v2-wt    162a    AT1G01040.1 2   3422-3443   3433    0
pare-i_226-v2-wt    162b    AT1G01040.1 2   3422-3443   3433    0
xrn4-pare-i_ath-227-v2-wt   827 AT1G02860.1 1   258-278 269 1
i2_lib2-v2  156a    AT1G03730.1 4   242-260 252 3
i2_lib2-v2  156b    AT1G03730.1 4   242-260 252 3
i2_lib2-v2  156c    AT1G03730.1 4   242-260 252 3
i2_lib2-v2  156d    AT1G03730.1 4   242-260 252 3
i2_lib2-v2  156e    AT1G03730.1 4   242-260 252 3

Basically if values in columns $3,$5 are identical, I want to combine rows for columns $2,$6 (or more), with the unique values of the remaining columns merged like this:

AT1G01040.1 3422-3443   3433    162a,162b
AT1G02860.1 258-278 269 827
AT1G03730.1 242-260 252 156a,156b,156c,156d,156e

Right now I am trying to do this in multiple steps, based on the answers here.

awk 'BEGIN{FS=OFS="\t"} {c=$2 FS $3 FS $5; if (c in a) a[c]=a[c]","$6; else a[c]=$6}END{for (k in a) print k,a[k]}'|awk '{p=$1 FS $2 FS $4; if (p in l) l[p]=l[p]","$3;else l[p]=$3}END{for (m in l) print m,l[m]}' <input.txt

Which gives:

AT1G01040.1 3422-3443   3433,3433   162a,162b
AT1G02860.1 258-278 269 827
AT1G03730.1 242-260 252 156a,156b,156c,156d,156e

I thought I should put the values in the remaining columns as arrays to get my desired output at one step, but I am struggling to figure out the correct context.

Upvotes: 0

Views: 191

Answers (1)

nu11p01n73R
nu11p01n73R

Reputation: 26687

How about something like

awk '{if ( $3 in a ) a[$3] = a[$3]","$2; else a[$3] = $3" "$5" "$6" "$2} END{for (i in a) print a[i]}' inputFile

Will produce output as

AT1G03730.1 242-260 252 156a,156b,156c,156d,156e
AT1G02860.1 258-278 269 827
AT1G01040.1 3422-3443 3433 162a,162b,162a,162b

Explanation

a[$3] = $3" "$5" "$6" "$2 creates an array a indexed by the third field $3, the else part ensures that the array is created when the row is encounterd for the first time.

if ( $3 in a ) a[$3] = a[$3]","$2 if the third field $3 is already present in the array, append field two $2 to the array

END{for (i in a) print a[i]} END block is excecuted at the end of input. Prints the entire array giving the output

EDIT

A simpler version would be

awk '{( $3 in a ) ? a[$3] = a[$3]","$2 : a[$3] = $3" "$5" "$6" "$2} END{for (i in a) print a[i]}' inputFile

Thank you Jotne for the suggestion.

Upvotes: 2

Related Questions