awk length is counting +1

I'm trying, as an exercise, to output how many words exist in the dictionary for each possible length. Here is my code:

$ awk '{print length}' dico.txt | sort -nr | uniq -c

Here is the output:

...
1799 5
427 4
81 3
1 2

My problem is that awk length count one more letter for each word in my file. The right output should have been:

1799 4
427 3
81 2
1 1

I checked my file and it does not contain any space after the word:

ABAISSA
ABAISSABLE
ABAISSABLES
ABAISSAI
...

So I guess awk is counting the newline as a character, despite the fact it is not supposed to. Is there any solution? Or something I'm doing wrong?

Upvotes: 2

Views: 535

Answers (1)

Ondrej K.
Ondrej K.

Reputation: 9664

I'm gonna venture a guess. Isn't your awk expecting "U*X" style newlines (LF), but your dico.txt has Windows style (CR+LF). That easily give you the +1 on all lengths.


I took your four words:

$ cat dico.txt 
ABAISSA
ABAISSABLE
ABAISSABLES
ABAISSAI

And ran your line:

$ awk '{print length}' dico.txt | sort -nr | uniq -c
      1 11
      1 10
      1 8
      1 7

So far so good. Now the same, but dico.txt with windows newlines:

$ cat dico.txt  | todos > dico_win.txt 
$ awk '{print length}' dico_win.txt | sort -nr | uniq -c
      1 12
      1 11
      1 9
      1 8

Upvotes: 5

Related Questions