vishalkin
vishalkin

Reputation: 1235

line break is not counted in character count

I have a following code that counts the number of characters in a file using awk.
but it doesn't count the line breaks as it is counted in $ wc file
file:abc

12345
12345
12345
12345
12345

awk command:

$ awk 'BEGIN{FS=""}{for(i=1;i<=NF;i++)c++}END{print "total chars:"c}' abc


This gives me o/p as

Total char:25

but if i run same abc file as wc abc it gives me o/p as 30 characters
Any suggestions whether i can use two file separators at a time???

Upvotes: 1

Views: 1985

Answers (3)

Scrutinizer
Scrutinizer

Reputation: 9936

Like I noted in this thread: Multiple Field separator in awk script awk can only give a correct result for proper text files, where limits like maximum line lengths are observed and the last lines ends with a newline, whereas wc does not have this limitation..

awk '{t+=length} END{print "Total chars: " NR+t}' file

wc does not care and will just count the characters..

=== edit === This might work:

awk '
  NR==FNR{
    m++
    next
  }
  {
    t+=length
  }
  m==FNR-1{
    RS="§"
  }
  END{
    print "Total chars: " FNR+t-1
  }
' file file

or in one line:

awk 'NR==FNR{ m++; next } { t+=length } m==FNR-1{ RS="§" } END{ print "Total chars: " FNR+t-1 } ' file file

The file is read twice to determine the number of lines and then at the second pass the record separator gets changed..

Upvotes: 3

Ed Morton
Ed Morton

Reputation: 204099

This is based on @Scrutinizer's solution to show one way to handle files that might not end in a newline (using GNU awk for RT) to address @konsolebox's concern:

gawk '{t+=length+(RT?1:0)} END{print t}' file

or, more efficiently, as @konsolebox pointed out:

gawk '{t+=length} END{print t+NR-(RT?0:1)}' file

To accommodate empty files:

gawk '{t+=length}END{print t+NR-(!RT&&NR?1:0)}'

Upvotes: 5

konsolebox
konsolebox

Reputation: 75558

Your records are still separated with RS so the 5 newlines are excluded from the count.

Use another delimiter for your FS and RS, and calculate the length of the whole $0 instead:

awk 'BEGIN{FS=RS="\x1c"}{c+=length($0)}END{print "total chars:"c}' abc

Output:

total chars:30

Note that using "" or "\x00" would make it skip the last character.

By concept it's actually the same as:

awk 'BEGIN{FS=RS="\x1c"}END{print "total chars:" length($0)}' abc

Assuming that file doesn't contain any \x1c. It would still be invalid either way anyway if it has.

Upvotes: 2

Related Questions