Reputation: 17
cat scorecard.csv|cut -d , -f6|sort -n|uniq -c
gives me word counts without repeats while,
cat scorecard.csv|cut -d , -f6|uniq -c|sort -n
gives me word counts but there are repeats and the count is not accurate. Why is this so, when they are very similar?
Here is some output for sort first then uniq-
9 AK
94 AL
89 AR
1 AS
122 AZ
714 CA
113 CO
81 CT
24 DC
20 DE
409 FL
1 FM
174 GA
3 GU
24 HI
88 IA
36 ID
275 IL
151 IN
84 KS
100 KY
130 LA
178 MA
91 MD
40 ME
1 MH
194 MI
124 MN
179 MO
1 MP
61 MS
33 MT
187 NC
29 ND
49 NE
40 NH
160 NJ
48 NM
41 NV
449 NY
313 OH
127 OK
86 OR
377 PA
137 PR
1 PW
24 RI
108 SC
30 SD
1 STABBR
176 TN
443 TX
75 UT
177 VA
2 VI
26 VT
117 WA
109 WI
73 WV
10 WY
Here is some output for uniq first then sort-
3 CA
3 CA
3 CA
3 CA
3 CO
3 CO
3 CO
3 CT
3 CT
3 CT
3 FL
3 IL
3 IL
3 IL
3 IL
3 IL
3 KY
3 MA
3 MA
3 MI
3 MI
3 MI
3 MO
3 MO
3 MO
3 MO
3 NC
3 NJ
3 NJ
3 NJ
3 NY
3 NY
3 NY
3 NY
3 OH
3 OH
3 OH
3 OH
3 OH
3 PA
3 PA
3 PA
3 PR
3 SC
3 TN
3 TN
3 TX
3 TX
3 TX
3 TX
3 TX
3 TX
3 TX
3 TX
3 TX
3 TX
3 UT
3 UT
3 VA
3 VA
3 WA
3 WA
3 WA
3 WI
3 WI
3 WV
4 AZ
4 CA
4 CA
4 CA
4 CA
4 FL
4 IL
4 IN
4 KS
4 MA
4 MD
4 MI
4 MS
4 NY
4 NY
4 PR
4 TX
4 TX
4 TX
4 UT
4 WI
5 AL
5 AR
5 CA
5 CO
5 FL
5 FL
5 FL
5 MO
5 NY
5 OK
5 PA
5 PR
5 TX
6 AK
6 CA
6 CT
6 FL
6 IL
6 NC
6 OH
6 OK
6 PA
6 PR
6 TX
6 TX
6 VA
7 FL
7 IL
7 NY
7 OH
7 TX
7 TX
7 TX
8 CA
8 CA
8 CA
8 FL
8 FL
8 GA
8 OH
8 PA
9 CA
9 CA
9 DE
9 FL
9 FL
9 IN
9 MO
10 OK
10 VA
10 WY
11 MO
11 NV
12 AZ
12 DC
14 CA
14 CA
14 HI
14 NY
14 PA
14 RI
15 ID
15 MN
16 MO
19 IN
21 VT
22 CA
22 FL
22 MI
23 UT
24 CA
24 IN
24 MT
25 ND
25 OH
26 IA
27 SD
29 KS
29 ME
30 KS
31 NH
32 NM
37 NE
38 AZ
39 MS
42 CT
43 WV
45 OH
49 IN
50 IA
56 OK
58 CO
59 AL
59 MD
61 AR
61 PR
62 OR
62 SC
63 PA
63 WI
64 LA
65 KY
65 WA
66 FL
67 FL
72 MO
81 NJ
82 GA
85 MN
90 VA
100 TN
106 MI
123 OH
125 MA
125 NC
169 IL
184 PA
185 TX
288 NY
301 CA
Upvotes: 1
Views: 105
Reputation: 33022
You have some non-adjacent duplicate lines in the input.
From man uniq
:
Filter adjacent matching lines ...
With no options, matching lines are merged to the first occurrence.
...
Note: 'uniq' does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use 'sort -u' without 'uniq'.
Also info uniq
:
By default,
uniq
prints its input lines, except that it discards all but the first of adjacent repeated lines, so that no output lines are repeated. Optionally, it can instead discard lines that are not repeated, or all repeated lines.The input need not be sorted, but repeated input lines are detected only if they are adjacent. If you want to discard non-adjacent duplicate lines, perhaps you want to use
sort -u
.
Upvotes: 1
Reputation: 6058
Adding to what @wjandrea said, sort -n
sorts numerically rather than alphabetically, so sort -n | uniq -c
is meaningless, because the input to sort -n
doesn't contain the numbers.
I suspect what you want is
cat scorecard.csv | cut -d , -f6 | sort | uniq -c | sort -n
Upvotes: 2