Reputation: 38
I am new to python re module, trying to read from a file and count the words. But regardless of whatever pattern I give, its adding an empty string to the list of words, when it reaches end of line.
I am reading the inputfile which has EOL - CRLF
words = re.split(r'[~\r\n]+|\.\s*|;\s*|,\s*|\s*|\.|\r\n|$', line)
Following is the input line and corresponding output.
This is a test line; to verify, the regex pattern used.
['This', 'is', 'a', 'test', 'line', 'to', 'verify', 'the', 'regex', 'pattern', '
used', '']
Upvotes: 0
Views: 270
Reputation: 3520
What about:
re.split(r'\W(?!\Z)', line)
Output:
['This', 'is', 'a', 'test', 'line', '', 'to', 'verify', '', 'the', 'regex', 'pattern', 'used.']
Its not perfect (period with the word 'used'), but it'll do the job for counting words.
EDIT
To be honest, you should just be using a space as the delimiter, and nothing else. My answer and @CSMaverick answer don't work on for example
hello-world I am
. To work for all the different cases, the regex would become quite dirty. I recommend you use something as simple as re.split(r'\s', line)
.
Upvotes: 0
Reputation: 2129
You could do something like this.
line = "This is a test line; to verify, the regex pattern used."
regx = re.compile("(\w[\w']*\w|\w)")
regx.findall(line)
#output
['This',
'is',
'a',
'test',
'line',
'to',
'verify',
'the',
'regex',
'pattern',
'used']
Hope it helps !
Upvotes: 1