Pavlos Maragkos
Pavlos Maragkos

Reputation: 91

Merge lines based on first column without delimiter

I need to merge all the lines that have the same value on the first column.

The input file is the following:

34600000031|(1|1|0|1|1|20190114180000|20191027185959)
34600000031|(2|2|0|2|2|20190114180000|20191027185959)
34600000031|(3|3|0|3|3|20190114180000|20191027185959)
34600000031|(4|4|0|4|4|20190114180000|20191027185959)
34600000015|(1|1|100|1|8|20190114180000|20191027185959)
34600000015|(2|2|100|2|9|20190114180000|20191027185959)
34600000015|(3|3|100|3|10|20190114180000|20191027185959)
34600000015|(4|4|100|4|11|20190114180000|20191027185959)

I was able to partially achieve it using the following:

awk -F'|' '$1!=p{if(p)print s; p=$1; s=$0; next}{sub(p,x); s=s $0} END{print s}' INPUT

The output is the following:

34600000031|(1|1|0|1|1|20190114180000|20191027185959)|(2|2|0|2|2|20190114180000|20191027185959)|(3|3|0|3|3|20190114180000|20191027185959)|(4|4|0|4|4|20190114180000|20191027185959)
34600000015|(1|1|100|1|8|20190114180000|20191027185959)|(2|2|100|2|9|20190114180000|20191027185959)|(3|3|100|3|10|20190114180000|20191027185959)|(4|4|100|4|11|20190114180000|20191027185959)

What I need (and i cannot find how) is the following:

34600000031|(1|1|0|1|1|20190114180000|20191027185959)(2|2|0|2|2|20190114180000|20191027185959)(3|3|0|3|3|20190114180000|20191027185959)(4|4|0|4|4|20190114180000|20191027185959)
34600000015|(1|1|100|1|8|20190114180000|20191027185959)(2|2|100|2|9|20190114180000|20191027185959)(3|3|100|3|10|20190114180000|20191027185959)(4|4|100|4|11|20190114180000|20191027185959)

I could do a sed after the initial awk but I don't believe that this is the proper way to do it.

Upvotes: 0

Views: 65

Answers (2)

Ed Morton
Ed Morton

Reputation: 203597

$ awk -F'|' '
    {
        curr = $1
        sub(/^[^|]+\|/,"")
        printf "%s%s", (curr==prev ? "" : ors curr FS), $0
        ors = ORS
        prev = curr
    }
    END { print "" }
' file
34600000031|(1|1|0|1|1|20190114180000|20191027185959)(2|2|0|2|2|20190114180000|20191027185959)(3|3|0|3|3|20190114180000|20191027185959)(4|4|0|4|4|20190114180000|20191027185959)
34600000015|(1|1|100|1|8|20190114180000|20191027185959)(2|2|100|2|9|20190114180000|20191027185959)(3|3|100|3|10|20190114180000|20191027185959)(4|4|100|4|11|20190114180000|20191027185959)

Upvotes: 0

KamilCuk
KamilCuk

Reputation: 141040

You need to substitute the separator in the values too. Your fixes awk would look like this:

awk -F'|' '$1!=p{if(p)print s; p=$1; s=$0; next}{sub(p "\\|",x); s=s $0} END{print s}'

but it's also good to match beginning of the string:

awk -F'|' '$1!=p{if(p)print s; p=$1; s=$0; next}{sub("^" p "\\|",x); s=s $0} END{print s}'

I would do it somewhat simpler, which uses more memory (as it stores everything in an array) but doesn't need the file to be sorted:

awk -F'|' '{ k=$1; sub("^" $1 "\\|", ""); a[k] = a[k] $0 } END{ for (i in a) print i "|" a[i] }'

For each line, remember the first field, substitute the first field with | for nothing, then add it to an array indexed by the first field. On the end, print each element in the array with the key, separator and value.

Upvotes: 1

Related Questions