awk length is counting +1

Question

I'm trying, as an exercise, to output how many words exist in the dictionary for each possible length. Here is my code:

$ awk '{print length}' dico.txt | sort -nr | uniq -c

Here is the output:

My problem is that awk length count one more letter for each word in my file. The right output should have been:

I checked my file and it does not contain any space after the word:

ABAISSA
ABAISSABLE
ABAISSABLES
ABAISSAI
...

So I guess awk is counting the newline as a character, despite the fact it is not supposed to. Is there any solution? Or something I'm doing wrong?

Ondrej K. · Accepted Answer

I'm gonna venture a guess. Isn't your awk expecting "U*X" style newlines (LF), but your dico.txt has Windows style (CR+LF). That easily give you the +1 on all lengths.

I took your four words:

$ cat dico.txt 
ABAISSA
ABAISSABLE
ABAISSABLES
ABAISSAI

And ran your line:

$ awk '{print length}' dico.txt | sort -nr | uniq -c
      1 11
      1 10
      1 8
      1 7

So far so good. Now the same, but dico.txt with windows newlines:

$ cat dico.txt  | todos > dico_win.txt 
$ awk '{print length}' dico_win.txt | sort -nr | uniq -c
      1 12
      1 11
      1 9
      1 8

awk length is counting +1

Answers (1)

Related Questions