Reputation: 193
The program is supposed to count the number of lines begin with a decimal number in parenthesis, containing a mix of both upper and lower case letters and end with a period.
I have
BEGIN {x=0}
/^\([0-9[0-9]*) [A-Z][A-z]* [a-z][a-z]* \.$/ {x = x+1}
END{print x}
I have them split on multiple different lines because I have been running display(!d) statements for debugging trying to figure it out.
To run i use awk -f programName.awk filename.txt
Any help is appreciated.
UPDATE
New code reads
BEGIN{x=0}
/^\([0-9]+\)[A-Za-z]+\.$/{x++}
END{print x}
I use vim EC.awk to edit this. awk -f EC.awk EC.txt to run comes back with 1. EC.txt contains 5 out of 12 lines that should be counted.
INPUT FILE vim EC.txt
(1) Line one, this should count.
(2)Line two. Should also count.
3 should not count..
4 not
(5)Yes.
(6). nope
7 OHHH mann
8 This suck
(9)Oh ya? YOU SUCK.
10 Cheaa
(11) BOI.
(12) WoW MoM. Print mofo.
UPDATED CODE
BEGIN{x=0}
/^\([0-9]+\).*?[A-Za-z]+\.$/{x++}
END{print x}
This gives me 6. I believe its counting line 11 (11) BOI. Working on printing out the lines to make sure.
Upvotes: 4
Views: 1624
Reputation: 8769
Your regex tries to match the following text (1 or more digits)<space><1 or more Uppercase><space><1 or more lowercase><space><period>
I think while posting the question you have missed out the ]
in case of digits, and if you want to have lowercase followed by uppercase then you must use your regex; but since you mentioned in your question it can be a mix of uppercase and lowercase you will have to use [A-Za-z]+
. +
ensures 1 or more i.e [a-z]+
is equivalent to [a-z][a-z]*
$cat file.txt
(1) aBCdadg .
(2) dgshdf .
(3) DFHFH .
xyz
abcd
(56) sdflgkfd .
$ cat prgm.awk
BEGIN {x=0}
/^\([0-9]+\) [A-Za-z]+ \.$/ {x++}
END {print x}
$ awk -f prgm.awk file.txt
4
$
And if you want to have 1 or more lowercase chars followed by 1 or more uppercase then you will have to use the this regex:
/^\([0-9]+\) [a-z]+ [A-Z]+ \.$/ {x++}
Edit:
$ cat file.txt
(1) Line one, this should count.
(2) Line two. Should also count.
3 should not count..
4 not
(5)Yes.
(6). nope
7 OHHH mann
8 This suck
(9) Oh ya? YOU SUCK.
10 Cheaa
(11) BOI.
(12) WoW MoM. Print mofo.
$ cat prgm.awk
BEGIN {x=0}
/^\([0-9]+\)\s*[A-Za-z0-9., ]+\s*\./{x++}
END {print x}
$ awk -f prgm.awk file.txt
5
$
Edit 2: Sorry i was in a hurry to go somewhere and was off my comp for few hours. Since its more clear what you need, i'll just update the answer for completeness.
$ cat prgm.awk
BEGIN {x=0}
/^\([0-9]+\).*([A-Z].*[a-z]|[a-z].*[A-Z]).*\.$/{x++;print $0}
END {print x}
$ awk -f prgm.awk input_file.txt
(1) Line one, this should count.
(2) Line two. Should also count.
(5)Yes.
(9) Oh ya? YOU SUCK.
(12) WoW MoM. Print mofo.
5
$
Do mark the question solved by accepting anyone's answer apart from mine :P :)
Edit 3: give others the credit.
Upvotes: 1
Reputation: 440677
For an alternative solution that expresses the intent more simply and clearly and is also locale-aware (doesn't invariably only match ASCII letters), see Ed Morton's helpful answer.
Try the following (POSIX-compliant):
awk '/^\([0-9]+\).*([A-Z].*[a-z]|[a-z].*[A-Z]).*\.$/ { ++x } END { print x+0 }' file
^\([0-9]+\)
matches a decimal number in parentheses at the beginning of a line.
\.$
matches a literal period at the end of a line.
.*([A-Z].*[a-z]|[a-z].*[A-Z]).*
matches any string in between that:
As for why your approach didn't work:
[A-Z][A-z] *[a-z][a-z]*
, only matches lines whose first [ASCII] letter on the line is uppercase; in other words: lines where the first letter on the line is lowercase aren't matched.[A-Za-z]+
, due to using a single character set any of whose characters are matched, also matches lines containing only uppercase or lowercase letters, which is why line (11) BOI.
also matches.Upvotes: 5
Reputation: 104111
It is best to break down the conditions into separate regex's sometimes:
/^\([0-9]+\)/
or /^\([[:digit:]]+\)/
/[A-Z]/
or /[[:upper:]]/
/[a-z]/
or /[[:lower:]]/
/\.[ \t]*$/
(the [ \t]*
catches trailing spaces if any...)Now just combine those conditions:
awk '/^\([[:digit:]]+\)/ && /\.[ \t]*$/ && /[[:lower:]]/ && /[[:upper:]]/ { print }' file
(1) Line one, this should count.
(2)Line two. Should also count.
(5)Yes.
(9)Oh ya? YOU SUCK.
(12) WoW MoM. Print mofo.
Then run through wc -l
to get the line count:
awk '//^\([[:digit:]]+\)/ && /\.[ \t]*$/ && /[[:lower:]]/ && /[[:upper:]]/ { print }' file | wc -l
5
Or, maintain your own count:
awk '/^\([[:digit:]]+\)/ && /\.[ \t]*$/ && /[[:lower:]]/ && /[[:upper:]]/ { i++ } END{print i}' file
5
The issue with your regex:
/^\([0-9]+\).*?[A-Za-z]+\.$/
^^ Any string of characters
^ ^ Could be 'UPPER' or 'lower'
.*
matches all characters (including spaces) leading up to,[A-Za-z]+
which matches a run of upper and/or lower case letter but does not tell you if you have both.Almost, but you are not detecting properly lines that fail to include both upper and lower case letters with that regex.
Upvotes: 1
Reputation: 204731
idk if this is the expected output or not since you didn't include that in your question but I just coded what you said in your question count the number of lines begin with a decimal number in parenthesis, containing a mix of both upper and lower case letters and end with a period
and added the print
so you can see what it matches so take a look and see if it does what you want:
$ cat tst.awk
/^\([0-9]+\)/ && /[[:upper:]]/ && /[[:lower:]]/ && /\.$/ { print; cnt++ }
END { print cnt+0 }
$ awk -f tst.awk file
(1) Line one, this should count.
(2)Line two. Should also count.
(5)Yes.
(9)Oh ya? YOU SUCK.
(12) WoW MoM. Print mofo.
5
Don't get stuck thinking that the condition part of an awk statement has to be a regexp, like if this was sed or grep, as it doesn't - it can be a compound condition of ands/ors of regexp segments if that's what makes your code simpler and clearer as in this case IMHO.
Upvotes: 3