Reputation: 27
I have a file with multiple columns (greater than 1000). Each column has numbers 0, 1 or some other. The tab delimited file looks like :
0 0 0
0 0 0
1 2 0
1 0 0
1 1 0
I want to calculate the occurrence of each unique digit for each column in the file. How do I do that using AWK or shell ?
P.S To calculate the occurrence of each unique digit in first column, i used AWK code :
awk '{h[$1]++}; END {for (k in h) print k, h[k]}' file > output-file
It gives the results as :
0 2
1 3
which means 0
occurs twice in column 1 and 1
occurs thrice in column 1.
I want to do the same for a file having over 1000 columns.
Upvotes: 1
Views: 1428
Reputation: 827
With awk 4.0 2D arrays
sample input matrix of n=3 columns containing integer values
0 0 0
0 0 0
1 2 0
1 0 0
1 1 0
4 0 0
7 -1 -2
output is vector of data values in column 0 that occur in input followed by matrix of n=3 columns with count of each data value in corresponding column of input matrix
-1 0 1 0
-2 0 0 1
0 2 4 6
1 3 1 0
2 0 1 0
4 1 0 0
7 1 0 0
code
awk '
NR==1 {ncols=NF}
{for(i=1; i <=NF; ++i) ++c[$i][i-1]}
END{
for(i in c) {
printf("%d ", i)
for(j=0; j < ncols; ++j) {
printf("%d ", j in c[i]?c[i][j]: 0)
}
printf("\n")
}
}
'
Upvotes: 0
Reputation: 113814
You just need to make the keys for associative array h
contain both column number, i
, and column value, $i
:
$ awk '{for (i=1;i<=NF;i++) h[i" "$i]++}; END {for (k in h) print k, h[k]}' file | sort -n
1 0 2
1 1 3
2 0 3
2 1 1
2 2 1
3 0 5
The last line above indicates that column 3 has the value 0 occurring 5 times.
In more detail:
for (i=1;i<=NF;i++) h[i" "$i]++
This loops over all columns from the first, i-=1
, to the last, i=NF
. For each column, it updates the counter h
for that column and its value.
END {for (k in h) print k, h[k]}
This prints a table of the output.
sort -n
Because for (k in h)
does not produce keys in any particular order, we put the output through sort.
Upvotes: 2