Reputation: 17

What is the difference sorting first then using uniq vs vice verse in BASH?

cat scorecard.csv|cut -d , -f6|sort -n|uniq -c

gives me word counts without repeats while,

cat scorecard.csv|cut -d , -f6|uniq -c|sort -n

gives me word counts but there are repeats and the count is not accurate. Why is this so, when they are very similar?

Here is some output for sort first then uniq-

  9 AK
 94 AL
 89 AR
  1 AS
122 AZ
714 CA
113 CO
 81 CT
 24 DC
 20 DE
409 FL
  1 FM
174 GA
  3 GU
 24 HI
 88 IA
 36 ID
275 IL
151 IN
 84 KS
100 KY
130 LA
178 MA
 91 MD
 40 ME
  1 MH
194 MI
124 MN
179 MO
  1 MP
 61 MS
 33 MT
187 NC
 29 ND
 49 NE
 40 NH
160 NJ
 48 NM
 41 NV
449 NY
313 OH
127 OK
 86 OR
377 PA
137 PR
  1 PW
 24 RI
108 SC
 30 SD
  1 STABBR
176 TN
443 TX
 75 UT
177 VA
  2 VI
 26 VT
117 WA
109 WI
 73 WV
 10 WY

Here is some output for uniq first then sort-

  3 CA
  3 CA
  3 CA
  3 CA
  3 CO
  3 CO
  3 CO
  3 CT
  3 CT
  3 CT
  3 FL
  3 IL
  3 IL
  3 IL
  3 IL
  3 IL
  3 KY
  3 MA
  3 MA
  3 MI
  3 MI
  3 MI
  3 MO
  3 MO
  3 MO
  3 MO
  3 NC
  3 NJ
  3 NJ
  3 NJ
  3 NY
  3 NY
  3 NY
  3 NY
  3 OH
  3 OH
  3 OH
  3 OH
  3 OH
  3 PA
  3 PA
  3 PA
  3 PR
  3 SC
  3 TN
  3 TN
  3 TX
  3 TX
  3 TX
  3 TX
  3 TX
  3 TX
  3 TX
  3 TX
  3 TX
  3 TX
  3 UT
  3 UT
  3 VA
  3 VA
  3 WA
  3 WA
  3 WA
  3 WI
  3 WI
  3 WV
  4 AZ
  4 CA
  4 CA
  4 CA
  4 CA
  4 FL
  4 IL
  4 IN
  4 KS
  4 MA
  4 MD
  4 MI
  4 MS
  4 NY
  4 NY
  4 PR
  4 TX
  4 TX
  4 TX
  4 UT
  4 WI
  5 AL
  5 AR
  5 CA
  5 CO
  5 FL
  5 FL
  5 FL
  5 MO
  5 NY
  5 OK
  5 PA
  5 PR
  5 TX
  6 AK
  6 CA
  6 CT
  6 FL
  6 IL
  6 NC
  6 OH
  6 OK
  6 PA
  6 PR
  6 TX
  6 TX
  6 VA
  7 FL
  7 IL
  7 NY
  7 OH
  7 TX
  7 TX
  7 TX
  8 CA
  8 CA
  8 CA
  8 FL
  8 FL
  8 GA
  8 OH
  8 PA
  9 CA
  9 CA
  9 DE
  9 FL
  9 FL
  9 IN
  9 MO
 10 OK
 10 VA
 10 WY
 11 MO
 11 NV
 12 AZ
 12 DC
 14 CA
 14 CA
 14 HI
 14 NY
 14 PA
 14 RI
 15 ID
 15 MN
 16 MO
 19 IN
 21 VT
 22 CA
 22 FL
 22 MI
 23 UT
 24 CA
 24 IN
 24 MT
 25 ND
 25 OH
 26 IA
 27 SD
 29 KS
 29 ME
 30 KS
 31 NH
 32 NM
 37 NE
 38 AZ
 39 MS
 42 CT
 43 WV
 45 OH
 49 IN
 50 IA
 56 OK
 58 CO
 59 AL
 59 MD
 61 AR
 61 PR
 62 OR
 62 SC
 63 PA
 63 WI
 64 LA
 65 KY
 65 WA
 66 FL
 67 FL
 72 MO
 81 NJ
 82 GA
 85 MN
 90 VA
100 TN
106 MI
123 OH
125 MA
125 NC
169 IL
184 PA
185 TX
288 NY
301 CA

Upvotes: 1

Answers (2)

wjandrea

Reputation: 33022

You have some non-adjacent duplicate lines in the input.

From man uniq:

Filter adjacent matching lines ...

With no options, matching lines are merged to the first occurrence.

...

Note: 'uniq' does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use 'sort -u' without 'uniq'.

Also info uniq:

By default, uniq prints its input lines, except that it discards all but the first of adjacent repeated lines, so that no output lines are repeated. Optionally, it can instead discard lines that are not repeated, or all repeated lines.

The input need not be sorted, but repeated input lines are detected only if they are adjacent. If you want to discard non-adjacent duplicate lines, perhaps you want to use sort -u.

Upvotes: 1

root

Reputation: 6058

Adding to what @wjandrea said, sort -n sorts numerically rather than alphabetically, so sort -n | uniq -c is meaningless, because the input to sort -n doesn't contain the numbers.

I suspect what you want is

cat scorecard.csv | cut -d , -f6 | sort | uniq -c | sort -n

Upvotes: 2

What is the difference sorting first then using uniq vs vice verse in BASH?

Answers (2)

Related Questions