Reputation: 123
I have a source UTF8 file (no BOM, windows EOL) that looks like this:
~someunicodetext_someunicodetext_someunicodetext~
some_more_unicode_text_some_more_unicode_text
~someunicodetext_someunicodetext_someunicodetext~
some_more_unicode_text_some_more_unicode_text
&&even_more_text_here
~someunicodetext_someunicodetext_someunicodetext~
some_more_unicode_text_some_more_unicode_text
~someunicodetext_someunicodetext_someunicodetext~
So there are 3 types of lines (4 if you count blank lines). My goal is to count each non-blank type using python regex. This is absolutely have to be regex-based solution using python 3.x, because I want to understand how it works.
My python script looks something like this:
import re, codecs
pattern = re.compile(r'some_expression_here')
count = 0
with codecs.open("some_input_file", "r", "UTF8") as inputFile:
inputFile=inputFile.read()
lines = re.findall(pattern, inputFile)
for match in lines:
count +=1
print (count)
The real problem I'm having is the actual regex expression.
~.*~
seem to be able to match lines like 1, 4, 8 in my example above (if we count starting from 1)
&&.*
matches line 6
But I can't figure out how to count non-marked lines, which are line 2,5,9.
In Notepad++ this expression ^(?!(~.*~)|(&&.*)).*
or simply this ^(?!~|&).*
works for me (even though it is not exactly correct), but all my attempts to replicate this in python failed...
Edit
inputFile.read()
doesn't reads the file the way I expect it to (hello windows EOL). Which is may or may not be important. It's output looks like this:
~someunicodetext_someunicodetext_someunicodetext~
some_more_unicode_text_some_more_unicode_text
~someunicodetext_someunicodetext_someunicodetext~
some_more_unicode_text_some_more_unicode_text
&&even_more_text_here
Upvotes: 1
Views: 2705
Reputation: 87064
You could try this pattern ^\w.*
with the re.MULTILINE flag`.
re.UNICODE
flag should also be used for Python 2.
Here is a complete example:
import re, codecs
with codecs.open("input.txt", "r", "UTF8") as inputFile:
data = inputFile.read()
pattern = re.compile(r'^\w.*', flags=re.MULTILINE)
lines = re.findall(pattern, data)
>>> data # note windows line termination
'~someunicodetext_someunicodetext_someunicodetext~\r\nsome_more_unicode_text_some_more_unicode_text\r\n \t\r\n~someunicodetext_someunicodetext_someunicodetext~\r\nsome_more_unicode_text_some_more_unicode_text\r\n&&even_more_text_here\r\n\r\n~someunicodetext_someunicodetext_someunicodetext~\r\nsome_more_unicode_text_some_more_unicode_text\r\n\r\n~someunicodetext_someunicodetext_someunicodetext~\r\n'
>>> print(lines)
['some_more_unicode_text_some_more_unicode_text\r', 'some_more_unicode_text_some_more_unicode_text\r', 'some_more_unicode_text_some_more_unicode_text\r']
>>> print(len(lines))
3
So the regex matches the "non-marked" non-blank lines as required.
Upvotes: 0
Reputation: 67968
x="~someunicodetext_someunicodetext_someunicodetext~ \n \n \nsome_more_unicode_text_some_more_unicode_text \n"
pattern=re.compile(r"(\S+)")
print len(pattern.findall(x))
This gives count of all lines excluding space.So blank lines don't get counted.Hope this helps.
Upvotes: 1
Reputation: 64288
The "non-marked" lines can be identified as the lines which aren't bland and do not start with ~
and do not start with &
.
So the following regex would work:
^[^&\s].*
read: ^
= match at the beginning, [^...]
= a single charachter which is not in, &\s
= the charchter &
or a whitespace character (i.e. not one of those), .*
= anything can come after that.
(I put in the \s
just in case, because you said you're having problems with newlines. I'm not sure it is needed)
Also, it is much better to read the file line by line. You get:
import re, codecs
pattern = re.compile(r'^[^&\s].*')
with codecs.open("some_input_file", "r", "UTF8") as inputFile:
count = sum( 1 for line in inputFile if re.search(pattern, line) )
print (count)
Upvotes: 0
Reputation: 123
Here is the answer. I'm still not sure if I'm handling windows EOL correctly and whatnot, but this seem to be works. Also I kinda hoped someone will answer with an explanation of where my issue was and why it works the way it works, but oh well.
What this does. We match every line that has ~EOL before it and ends with another EOL. At the same time we make sure we exclude matches that have 2 or more consecutive EOLs.
So. This matches only the lines directly below the lines that are marked with ~
import re, codecs
regex = re.compile(r'(?!~(\r\n){2,})~\r\n.*\r\n', re.MULTILINE)
count = 0
with codecs.open('input_file', 'r', 'UTF8') as inputFile:
inputFile=inputFile.read()
lines = re.findall(regex, inputFile)
for match in lines:
count +=1
print (count)
Upvotes: 0