Reputation: 73

Merging file based on more than one column in bash

I have a file with multiple columns (separated by commas) where the values repeat. What I want to do is merge or "roll up" the rows based on these columns.

For example, say I have the following:

Input File:

ID, Name , Eye Color, Hair Color, Marital Status
1 , John , Brown    , Brown     , Single
1 , Mary , Green    , Brown     , Married
2 , Joe  , Blue     , Blonde    , Divorced
2 , Brian, Green    , Brown     , Single
2 , Gary , Brown    , Blonde    , Married

And I want the following output based on the first and fourth columns:

Output File:

ID, Name , Eye Color, Hair Color, Marital Status, Name, Eye Color, Hair Color, Marital Status
1 , John , Brown    , Brown     , Single        , Mary, Green    , Brown     , Married
2 , Joe  , Blue     , Blonde    , Divorced      , Gary, Brown    , Blonde    , Married
2 , Brian, Green    , Brown     , Single

I can do this for the first column with the following awk:

awk -F, '
    NR!=1 && p1!=$1 { print prev; prev="" }
    { p1=$1; prev=(prev"") ? prev FS substr($0,index($0,$2)) : $0 }
    END { if(prev"") print prev }
' input.txt > output.txt

I need to find a way to include the fourth column as well.

Upvotes: 0

Answers (2)

Michael Back

Reputation: 1871

The following uses three associative arrays to keep track of everything...

The x array is a two dimensional "cross reference" -- having indexes of column 1 and column 4 values and having a stored value being the row number where we will locate matching lines.
The g array keeps track of the number of times we "grow" a particular row. (We use this array to grow the header when needed.)
The o array is our output array that aggregates our lines of data.

We use sprintf to reformat the last column of input prior to appending to output to preserve the spacing of the table per the question's output requirement.

awk -F, -v OFS=, '
    ($1, $4) in x {
        # existent row -- append
        i = x[$1, $4]
        $1 = ""
        $5 = sprintf("%-15s", $5)
        o[i] = o[i] $0
        if(++g[i] > g[1]) {
            # grow the header row
            o[1] = o[1] h
            ++g[1]
        }
        next
    }
    {
        # new output row
        x[$1, $4] = ++n
        $5 = sprintf("%-15s", $5)
        o[n] = $0
    }
    NR==1 {
        # save header for append
        $1 = ""
        h = $0
    }
    END {
        for (i=1; i<=n; ++i)
            print o[i]
    }'

Output is what is specified in the question (including both order and formatting). One possible weakness in the above is that formatting (spacing) may not hold if someone has a Rather Complicated Marital Status. Also, if trailing white-space is unacceptable, it is easy to trim away in the END block's for loop via sub(/ +$/, "", o[i]).

Upvotes: 0

karakfa

Reputation: 67497

here is the general idea, doesn't assume records are ordered (but doesn't preserve the order either)

$ awk 'BEGIN{ FS=" *, *"; OFS=","} 
       NR==1{split($0,header);next} 
            {a[$1,$4]=(($1,$4) in a?a[$1,$4] OFS:"") $0} 
         END{for(k in a) print a[k]}' file

2 , Joe  , Blue     , Blonde    , Divorced,2 , Gary , Brown    , Blonde    , Married
2 , Brian, Green    , Brown     , Single
1 , John , Brown    , Brown     , Single,1 , Mary , Green    , Brown     , Married

you can filter the unwanted repeated columns as in your logic and need to work to fit the header to the max length of the matched records...

Formatted version with header extended can be

$ awk 'BEGIN{FS=" *, *"; OFS=","}
       NR==1{$1=$1; header0=$0; split($0,header); next} 
            {$1=$1; c[$1,$4]++; 
             a[$1,$4]=(($1,$4) in a?a[$1,$4] OFS $2 OFS $3 OFS $5:$0)}
         END{for(k in c) if(max<c[k]) max=c[k]; 
             printf "%s",header0; 
             for(i=2;i<=max;i++) printf "%s", OFS header[2] OFS header[3] OFS header[5]; 
             print ""; 
             for(k in a) print a[k] | "sort -n" }' file | 
  column -ts,


ID  Name   Eye Color  Hair Color  Marital Status  Name  Eye Color  Marital Status
1   John   Brown      Brown       Single          Mary  Green      Married
2   Brian  Green      Brown       Single
2   Joe    Blue       Blonde      Divorced        Gary  Brown      Married

Upvotes: 2

Merging file based on more than one column in bash

Answers (2)

Related Questions