Merge lines based on first column without delimiter

Question

I need to merge all the lines that have the same value on the first column.

The input file is the following:

34600000031|(1|1|0|1|1|20190114180000|20191027185959)
34600000031|(2|2|0|2|2|20190114180000|20191027185959)
34600000031|(3|3|0|3|3|20190114180000|20191027185959)
34600000031|(4|4|0|4|4|20190114180000|20191027185959)
34600000015|(1|1|100|1|8|20190114180000|20191027185959)
34600000015|(2|2|100|2|9|20190114180000|20191027185959)
34600000015|(3|3|100|3|10|20190114180000|20191027185959)
34600000015|(4|4|100|4|11|20190114180000|20191027185959)

I was able to partially achieve it using the following:

awk -F'|' '$1!=p{if(p)print s; p=$1; s=$0; next}{sub(p,x); s=s $0} END{print s}' INPUT

The output is the following:

34600000031|(1|1|0|1|1|20190114180000|20191027185959)|(2|2|0|2|2|20190114180000|20191027185959)|(3|3|0|3|3|20190114180000|20191027185959)|(4|4|0|4|4|20190114180000|20191027185959)
34600000015|(1|1|100|1|8|20190114180000|20191027185959)|(2|2|100|2|9|20190114180000|20191027185959)|(3|3|100|3|10|20190114180000|20191027185959)|(4|4|100|4|11|20190114180000|20191027185959)

What I need (and i cannot find how) is the following:

34600000031|(1|1|0|1|1|20190114180000|20191027185959)(2|2|0|2|2|20190114180000|20191027185959)(3|3|0|3|3|20190114180000|20191027185959)(4|4|0|4|4|20190114180000|20191027185959)
34600000015|(1|1|100|1|8|20190114180000|20191027185959)(2|2|100|2|9|20190114180000|20191027185959)(3|3|100|3|10|20190114180000|20191027185959)(4|4|100|4|11|20190114180000|20191027185959)

I could do a sed after the initial awk but I don't believe that this is the proper way to do it.

KamilCuk · Accepted Answer

You need to substitute the separator in the values too. Your fixes awk would look like this:

awk -F'|' '$1!=p{if(p)print s; p=$1; s=$0; next}{sub(p "\|",x); s=s $0} END{print s}'

but it's also good to match beginning of the string:

awk -F'|' '$1!=p{if(p)print s; p=$1; s=$0; next}{sub("^" p "\|",x); s=s $0} END{print s}'

I would do it somewhat simpler, which uses more memory (as it stores everything in an array) but doesn't need the file to be sorted:

awk -F'|' '{ k=$1; sub("^" $1 "\|", ""); a[k] = a[k] $0 } END{ for (i in a) print i "|" a[i] }'

For each line, remember the first field, substitute the first field with | for nothing, then add it to an array indexed by the first field. On the end, print each element in the array with the key, separator and value.

Merge lines based on first column without delimiter

Answers (2)

Related Questions