Reputation: 5313
I have a CSV file from which I would like to extract some pieces of information: for each distinct value in one colum, I would like to compute the sum of the corresponding values in another column. Eventually, I may do it in Python, but I believe there could be a simple solution using awk
.
This could be the CSV file:
2 1:2010-1-bla:bla 1.6
2 2:2010-1-bla:bla 1.1
2 2:2010-1-bla:bla 3.4
2 3:2010-1-bla:bla -1.3
2 3:2010-1-bla:bla 6.0
2 3:2010-1-bla:bla 1.1
2 4:2010-1-bla:bla -1.0
2 5:2010-1-bla:bla 10.9
I would like to get:
1 1.6
2 4.5
3 5.8
4 -1.0
5 10.9
For now, I can only extract:
a) the values of the first colum:
awk -F ' ' '{print $(2)}' MyFile.csv | awk -F ':' '{print $(1)}'
and then get:
1
2
2
3
3
3
4
5
b) and the values equal to, say, 1.1
in the last column with:
awk -F ' ' '{print $(NF)}' MyFile.csv | awk '$1 == 1.1'
and then get:
1.1
1.1
I am not able to simultaneously extract the columns I am interested in, which may help me in the end. Here is a sample output which may ease the computation of the sums (I don't know):
1 1.6
2 1.1
2 3.4
3 -1.3
3 6.0
3 1.1
4 -1.0
5 10.9
Edit: Thanks to Elenaher, we could say the input is the file above.
Upvotes: 15
Views: 7670
Reputation: 3451
If Perl is an option:
perl -F'(\s+|:)' -lane '$h{$F[2]} += $F[-1]; END{print "$_ $h{$_}" for sort keys %h}' file
output:
1 1.6
2 4.5
3 5.8
4 -1
5 10.9
These command-line options are used:
-n
loop around every line of the input file-l
removes newlines before processing, and adds them back in afterwards -a
autosplit mode – split input lines into the @F
array. Defaults to splitting on whitespace. -e
execute the perl code -F
autosplit modifier, in this case splits on a color or one-or-more whitespace @F
is the array of words in each line, indexed starting with $F[0]
$F[-1]
is the last word
Store result in hash %h
At the END, iterate through the sorted keys of the hash
Print each element $_
and the hash value $h{$_}
Upvotes: 1
Reputation: 4085
For your last question, you can use split
and display simultaneously the two columns :
cat filename | awk '{split($2,tab,":"); id = tab[1]; print id " -> " $3;}'
That prints :
1 -> 1.6
2 -> 1.1
2 -> 3.4
3 -> -1.3
3 -> 6.0
3 -> 1.1
4 -> -1.0
5 -> 10.9
For the complete result you can use :
awk -F, '{ split($1,line," "); split(line[2],tab,":"); id=tab[1]; if (sums[id]=="") {sums[id] = 0;} sums[id]+=line[3];} END {for (i=1;i<=length(sums);i++) print i " -> "sums[i]}' < test
that prints :
1 -> 1.6
2 -> 4.5
3 -> 5.8
4 -> -1
5 -> 10.9
Upvotes: 1
Reputation: 11366
{
b=$2; # assign column 2 to the variable 'b'
sub( /:.*/, "", b); # get rid of everything after the first colon in b
results[b] += $3
}
END { for (result in results )print result " " results[result] }
Upvotes: 0
Reputation: 342383
$ awk -F"[: \t]+" '{a[$2]+=$NF}END{for(i in a ) print i,a[i] }' file
4 -1
5 10.9
1 1.6
2 4.5
3 5.8
Upvotes: 12
Reputation: 154494
So, assuming that your input looks like this:
unique_col, to_sum
1.3, 1 2 3
1.3, 5 6 7
1.4, 2 3 4
Then this should do the trick:
$ awk -F, '{ if (seen[$1] == "") { split($2, to_sum, " "); seen[$1] = 0; for (x in to_sum) seen[$1] += to_sum[x]; }} END { for (x in seen) { if (x != "") { print x " " seen[x]; }}}' < input
1.3 6
1.4 9
Upvotes: 2
Reputation: 118661
This is assuming you have the two columns you showed before: 1 1.1
BEGIN {
last = "";
sum = 0;
}
{
if ($1 != last) {
if (last != "") {
print last " " sum;
}
sum = 0;
last = $1;
}
sum = sum + $2
}
END {
print last " " sum;
}
Upvotes: 4