Reputation: 4097

Linux: counting spaces and other characters in file

Problem:

I need to match an exact format for a mailing machine software program. It expects a certain format. I can count the number of new lines, carriage returns, tabs ...etc. using tools like

cat -vte

and

od -c

and

wc -l ( or wc -c )

However, I'd like to know the exact number of leading and trailing spaces between characters and sections of text. Tabs as well.

Question:

How would you go about analyzing then matching a template exactly using common unix tools + perl or python? One-liners preferred. Also, what's your advice for matching a DOS encoded file? Would you translate it to NIX first, then analyze, or leave, as is?

UPDATE

Using this to see individual spaces [ assumes no '%' chars in file ]:

sed 's/ /%/g' filename.000

Plan to build a script that analyzes each line's tab and space content.

Using @shiplu's solution with a nod to the anti-cat crowd:

while read l;do echo $l;echo $((`echo $l |  wc -c` - `echo $l | tr -d ' ' | wc -c`));done<filename.000

Still needs some tweaks for Windows but it's well on it's way.

SAMPLE TEXT

Key for reading:

newlines marked with \n

Carriage returns marked with \r

Unknown space/tab characters marked with [:space:] ( need counts on those )

\r\n
\n
[:space:]Institution Anon LLC\r\n
[:space:]123 Blankety St\r\n
[:space:]Greater Abyss, AK  99999\r\n
\n
\n
[:space:]                                10/27/2011\r\n
[:space:]Requested materials are available for pickup:\r\n
[:space:]e__\r[:space:]                     D_ \r[:space:]   _O\r\n
[:space:]Bathtime for BonZo[:space:]       45454545454545[:space:]  10/27/2011\r\n
[:space:]Bathtime for BonZo[:space:]       45454545454545[:space:]  10/27/2011\r\n
\n
\n
\n
\n
\n
\n
[:space:]                             Pantz McManliss\r\n
[:space:]                             Gibberish Ave\r\n
[:space:]                             Northern Mirkwood, ME  99999\r\n
( untold variable amounts of \n chars go here )

UPDATE 2

Using IFS with read gives similar results to the ruby posted by someone below.

while IFS='' read -r line
 do 
     printf "%s\n" "$line" | sed 's/ /%/g' | grep -o '%' | wc -w
 done < filename.000

Upvotes: 3

Answers (7)

inger

Reputation: 20194

In case Ruby counts (it does count :)

ruby -lne 'puts scan(/\s/).size'

and now some Perl (slightly less intuitive IMHO):

perl -lne 'print scalar(@{[/(\s)/g]})'

Upvotes: 1

TLP

Reputation: 67890

perl -nwE 'print; for my $s (/([\t ]+)/g) { say "Count: ", length $s }' input.txt

This will count individual groups of tab or space, instead of counting all the whitespace in the entire line. For example:

    foo        bar

Will print

    foo        bar
Count: 4
Count: 8

You may wish to skip single spaces (spaces between words). I.e. don't count the spaces in Bathtime for BonZo. If so, replace + with {2,} or whatever minimum you think is appropriate.

Upvotes: 2

ikegami

Reputation: 385976

perl -nlE'say 0+( () = /\s/g );'

Unlike the currently accepted answer, this doesn't split the input into fields, discarding the result. It also doesn't needlessly create an array just to count the number of values in a list.

Idioms used:

0+( ... ) imposes scalar context like scalar( ... ), but it's clearer because it tells the reader a number is expected.
List assignment in scalar context returns the number of elements returned by its RHS, so 0+( () = /.../g ) gives the number of times () = /.../g matched.
-l, when used with -n, will cause the input to be "chomped", so this removes line feeds from the count.

If you're just interested in spaces (U+0020) and tabs (U+0009), the following is faster and simpler:

perl -nE'say tr/ \t//;'

In both cases, you can pass the input via STDIN or via a file named by an argument.

Upvotes: 5

Shiplu Mokaddim

Reputation: 57660

If you want to count the number of spaces in pm.txt, this command will do,

 cat pm.txt | while read l; 
 do echo $((`echo $l |  wc -c` - `echo $l | tr -d ' ' | wc -c`));
 done;

If you want to count the number of spaces, \r, \n, \t use this,

cat pm.txt | while read l;
do echo $((`echo $l |  wc -c` - `echo $l | tr -d ' \r\n\t' | wc -c`));
done;

read will strip any leading characters. If you dont want it, there is a nasty way. First split your file so that only 1 lines are there per file using

`split -l 1 -d pm.txt`.

After that there will be bunch of x* files. Now loop through it.

for x in x*; do echo $((`cat $x |  wc -c` - `cat $x | tr -d ' \r\n\t' | wc -c`)); done;

Remove the those files by rm x*;

Upvotes: 1

user unknown

Reputation: 36229

counting blanks:

sed 's/[^ ]//g' FILE | tr -d "\n" | wc -c

before, behind and between text. Do you want to count newlines, tabs, etc. in the same go and sum them up, or as separate step?

Upvotes: 2

Pete Wilson

Reputation: 8694

If you ask me, I'd write a simple C program to do the counting and formatting all in one go. But that's just me. By the time I got finished fiddle-farting around with perl, schmerl, zwerl I'd have wasted half a day.

Upvotes: -1

Jonathon Reinhart

Reputation: 137448

Regular expressions in Perl or Python would be the way to go here.

Yes, it may take an initial time investment to learn "perl, schmerl, zwerl" but once you've gained experience with an extremely powerful tool like Regular Expressions, it can save you an enormous amount of time down the road.

enter image description here

Upvotes: 4

Linux: counting spaces and other characters in file

Answers (7)

Related Questions