Reputation: 73
I have a file with multiple columns (separated by commas) where the values repeat. What I want to do is merge or "roll up" the rows based on these columns.
For example, say I have the following:
Input File:
ID, Name , Eye Color, Hair Color, Marital Status
1 , John , Brown , Brown , Single
1 , Mary , Green , Brown , Married
2 , Joe , Blue , Blonde , Divorced
2 , Brian, Green , Brown , Single
2 , Gary , Brown , Blonde , Married
And I want the following output based on the first and fourth columns:
Output File:
ID, Name , Eye Color, Hair Color, Marital Status, Name, Eye Color, Hair Color, Marital Status
1 , John , Brown , Brown , Single , Mary, Green , Brown , Married
2 , Joe , Blue , Blonde , Divorced , Gary, Brown , Blonde , Married
2 , Brian, Green , Brown , Single
I can do this for the first column with the following awk:
awk -F, '
NR!=1 && p1!=$1 { print prev; prev="" }
{ p1=$1; prev=(prev"") ? prev FS substr($0,index($0,$2)) : $0 }
END { if(prev"") print prev }
' input.txt > output.txt
I need to find a way to include the fourth column as well.
Upvotes: 0
Views: 99
Reputation: 1871
The following uses three associative arrays to keep track of everything...
x
array is a two dimensional "cross reference" -- having indexes of column 1 and column 4 values and having a stored value being the row number where we will locate matching lines.g
array keeps track of the number of times we "grow" a particular row. (We use this array to grow the header when needed.)o
array is our output array that aggregates our lines of data.We use sprintf
to reformat the last column of input prior to appending to output to preserve the spacing of the table per the question's output requirement.
awk -F, -v OFS=, '
($1, $4) in x {
# existent row -- append
i = x[$1, $4]
$1 = ""
$5 = sprintf("%-15s", $5)
o[i] = o[i] $0
if(++g[i] > g[1]) {
# grow the header row
o[1] = o[1] h
++g[1]
}
next
}
{
# new output row
x[$1, $4] = ++n
$5 = sprintf("%-15s", $5)
o[n] = $0
}
NR==1 {
# save header for append
$1 = ""
h = $0
}
END {
for (i=1; i<=n; ++i)
print o[i]
}'
Output is what is specified in the question (including both order and formatting). One possible weakness in the above is that formatting (spacing) may not hold if someone has a Rather Complicated Marital Status. Also, if trailing white-space is unacceptable, it is easy to trim away in the END
block's for
loop via sub(/ +$/, "", o[i])
.
Upvotes: 0
Reputation: 67497
here is the general idea, doesn't assume records are ordered (but doesn't preserve the order either)
$ awk 'BEGIN{ FS=" *, *"; OFS=","}
NR==1{split($0,header);next}
{a[$1,$4]=(($1,$4) in a?a[$1,$4] OFS:"") $0}
END{for(k in a) print a[k]}' file
2 , Joe , Blue , Blonde , Divorced,2 , Gary , Brown , Blonde , Married
2 , Brian, Green , Brown , Single
1 , John , Brown , Brown , Single,1 , Mary , Green , Brown , Married
you can filter the unwanted repeated columns as in your logic and need to work to fit the header to the max length of the matched records...
Formatted version with header extended can be
$ awk 'BEGIN{FS=" *, *"; OFS=","}
NR==1{$1=$1; header0=$0; split($0,header); next}
{$1=$1; c[$1,$4]++;
a[$1,$4]=(($1,$4) in a?a[$1,$4] OFS $2 OFS $3 OFS $5:$0)}
END{for(k in c) if(max<c[k]) max=c[k];
printf "%s",header0;
for(i=2;i<=max;i++) printf "%s", OFS header[2] OFS header[3] OFS header[5];
print "";
for(k in a) print a[k] | "sort -n" }' file |
column -ts,
ID Name Eye Color Hair Color Marital Status Name Eye Color Marital Status
1 John Brown Brown Single Mary Green Married
2 Brian Green Brown Single
2 Joe Blue Blonde Divorced Gary Brown Married
Upvotes: 2