Reputation: 4097
Problem:
I need to match an exact format for a mailing machine software program. It expects a certain format. I can count the number of new lines, carriage returns, tabs ...etc. using tools like
cat -vte
and
od -c
and
wc -l ( or wc -c )
However, I'd like to know the exact number of leading and trailing spaces between characters and sections of text. Tabs as well.
Question:
How would you go about analyzing then matching a template exactly using common unix tools + perl or python? One-liners preferred. Also, what's your advice for matching a DOS encoded file? Would you translate it to NIX first, then analyze, or leave, as is?
UPDATE
Using this to see individual spaces [ assumes no '%' chars in file ]:
sed 's/ /%/g' filename.000
Plan to build a script that analyzes each line's tab and space content.
Using @shiplu's solution with a nod to the anti-cat crowd:
while read l;do echo $l;echo $((`echo $l | wc -c` - `echo $l | tr -d ' ' | wc -c`));done<filename.000
Still needs some tweaks for Windows but it's well on it's way.
SAMPLE TEXT
Key for reading:
newlines marked with \n
Carriage returns marked with \r
Unknown space/tab characters marked with [:space:] ( need counts on those )
\r\n
\n
[:space:]Institution Anon LLC\r\n
[:space:]123 Blankety St\r\n
[:space:]Greater Abyss, AK 99999\r\n
\n
\n
[:space:] 10/27/2011\r\n
[:space:]Requested materials are available for pickup:\r\n
[:space:]e__\r[:space:] D_ \r[:space:] _O\r\n
[:space:]Bathtime for BonZo[:space:] 45454545454545[:space:] 10/27/2011\r\n
[:space:]Bathtime for BonZo[:space:] 45454545454545[:space:] 10/27/2011\r\n
\n
\n
\n
\n
\n
\n
[:space:] Pantz McManliss\r\n
[:space:] Gibberish Ave\r\n
[:space:] Northern Mirkwood, ME 99999\r\n
( untold variable amounts of \n chars go here )
UPDATE 2
Using IFS with read gives similar results to the ruby posted by someone below.
while IFS='' read -r line
do
printf "%s\n" "$line" | sed 's/ /%/g' | grep -o '%' | wc -w
done < filename.000
Upvotes: 3
Views: 6873
Reputation: 20194
In case Ruby counts (it does count :)
ruby -lne 'puts scan(/\s/).size'
and now some Perl (slightly less intuitive IMHO):
perl -lne 'print scalar(@{[/(\s)/g]})'
Upvotes: 1
Reputation: 67890
perl -nwE 'print; for my $s (/([\t ]+)/g) { say "Count: ", length $s }' input.txt
This will count individual groups of tab or space, instead of counting all the whitespace in the entire line. For example:
foo bar
Will print
foo bar
Count: 4
Count: 8
You may wish to skip single spaces (spaces between words). I.e. don't count the spaces in Bathtime for BonZo
. If so, replace +
with {2,}
or whatever minimum you think is appropriate.
Upvotes: 2
Reputation: 385976
perl -nlE'say 0+( () = /\s/g );'
Unlike the currently accepted answer, this doesn't split the input into fields, discarding the result. It also doesn't needlessly create an array just to count the number of values in a list.
Idioms used:
0+( ... )
imposes scalar context like scalar( ... )
, but it's clearer because it tells the reader a number is expected.0+( () = /.../g )
gives the number of times () = /.../g
matched.-l
, when used with -n
, will cause the input to be "chomped", so this removes line feeds from the count.If you're just interested in spaces (U+0020) and tabs (U+0009), the following is faster and simpler:
perl -nE'say tr/ \t//;'
In both cases, you can pass the input via STDIN or via a file named by an argument.
Upvotes: 5
Reputation: 57660
If you want to count the number of space
s in pm.txt
, this command will do,
cat pm.txt | while read l;
do echo $((`echo $l | wc -c` - `echo $l | tr -d ' ' | wc -c`));
done;
If you want to count the number of space
s, \r
, \n
, \t
use this,
cat pm.txt | while read l;
do echo $((`echo $l | wc -c` - `echo $l | tr -d ' \r\n\t' | wc -c`));
done;
read
will strip any leading characters. If you dont want it, there is a nasty way. First split your file so that only 1 lines are there per file using
`split -l 1 -d pm.txt`.
After that there will be bunch of x*
files. Now loop through it.
for x in x*; do echo $((`cat $x | wc -c` - `cat $x | tr -d ' \r\n\t' | wc -c`)); done;
Remove the those files by rm x*
;
Upvotes: 1
Reputation: 36229
counting blanks:
sed 's/[^ ]//g' FILE | tr -d "\n" | wc -c
before, behind and between text. Do you want to count newlines, tabs, etc. in the same go and sum them up, or as separate step?
Upvotes: 2
Reputation: 8694
If you ask me, I'd write a simple C program to do the counting and formatting all in one go. But that's just me. By the time I got finished fiddle-farting around with perl, schmerl, zwerl I'd have wasted half a day.
Upvotes: -1
Reputation: 137448
Regular expressions in Perl or Python would be the way to go here.
Yes, it may take an initial time investment to learn "perl, schmerl, zwerl" but once you've gained experience with an extremely powerful tool like Regular Expressions, it can save you an enormous amount of time down the road.
Upvotes: 4