Reputation: 1970
Using the following data:
$cat p1.csv
R,3
R,4
S,1
S,2
S,3
R,2
T,4
R,3
ST,4
RST,2
RSTR,4
First sort based on column 2:
$cat p1.csv | sort -t "," -k2
S,1
R,2
RST,2
S,2
R,3
R,3
S,3
R,4
RSTR,4
ST,4
T,4
I want to count the number of 1's, 2's, 3's, and 4's from column two. Something like $cat p1.csv | sort -t "," -k2 | uniq -f2 -c .....Can uniq be applied to one column? The -f2 doesn't properly apply uniq to the correct field. The output should take the first instance of a unique value in column two and count the number of them. Hence, the data must first be sorted on column two. A correct output would look like:
1 S,1
3 R,2
3 R,3
4 R,4
Suggestions?
Upvotes: 2
Views: 17663
Reputation: 1970
I just ran into an interesting case of repeated values in the data set that was easily resolvable and I just wanted to throw in the case for anyone that might be interested.
I'm adding two additions (*) to the dataset:
$cat p1.space
R 3
R 4
S 1
S 2
S 3
R 2
T 4
R 3
ST 4
RST 2
RSTR 4
R* 5
R* 5
The new value includes a repeated entry R* 5. Sorting on column two, we see the new values below:
$sort -k2 p1.space
S 1
R 2
RST 2
S 2
R 3
R 3
S 3
R 4
RSTR 4
ST 4
T 4
R* 5
R* 5
Notice how the repeated item (R* 5) is counted as two entries:
$sort -k2 p1.space | uniq -f1 -c
1 S 1
3 R 2
3 R 3
4 R 4
2 R* 5
To save the counts for uniq rows, we have to uniq the dataset before counting the number of uniq rows:
$sort -k2 p1.space | uniq | uniq -f1 -c
1 S 1
3 R 2
2 R 3
4 R 4
1 R* 5
I hope that helps anyone who might have a similar counting requirement. Good luck! And thank you @Jaypal.
Upvotes: 1
Reputation: 77115
Your question isn't quite clear so I just reverse engineered your output to your input (assuming there is a typo in your output since you mention to count number of 1's, 2's and 3's from column 2 and show 2 R,2
). You'll probably need to explain your question a little better -
sort -t "," -k2 < p1.csv |
awk -F, '!z[$2]++{ a[$2]=$0; } END {for (i in a) print z[i], a[i]}' |
sort -k1
- !z[$2]++ removes the duplicates based on column 2 as awk progresses thru
each line.
- a[$2]=$0 stores the non-duplicates lines in an array
- END {..} looks at all the keys in array and pulls up values. For array a
it pulls up the first line it sees with unique column 2 (as your desired
output). For array z it pulls up number of lines seen with same column 2.
[jaypal:~/temp] cat file
R,3
R,4
S,1
S,2
S,3
R,2
T,4
R,3
ST,4
RST,2
RSTR,4
[jaypal:~/temp] sort -t "," -k2 < t |
awk -F, '!z[$2]++{ a[$2]=$0; } END {for (i in a) print z[i], a[i]}' |
sort -k1
1 S,1
3 R,2
3 R,3
4 R,4
For finding unique entries based on column, you can try sort with -u option (but it won't give you the counts though).
From the man
page:
-u, --unique
with -c, check for strict ordering;
without -c, output only the first of an equal run
You can try something like this -
sort -t, -k2 p1.csv | sort -u -t, -k2
I am not sure Uniq can be performed on a column separated by a delimiter other than a blank. Atleast on my mac it doesn't. Here is the man page reference
-f num Ignore the first num fields in each input line when doing comparisons.
A field is a string of non-blank characters separated
from adjacent fields by blanks. Field numbers are one based,
i.e., the first field is field one.
So if you can remove the ,
delimiter and run the following command you should get your desired result.
sort -k2 test | uniq -c -f1
[jaypal:~/temp] cat test
R 3
R 4
S 1
S 2
S 3
R 2
T 4
R 3
ST 4
RST 2
RSTR 4
[jaypal:~/temp] sort -k2 test | uniq -c -f1
1 S 1
3 R 2
3 R 3
4 R 4
Upvotes: 11