Manasa Devadas
Manasa Devadas

Reputation: 38

python re.split adding an empty string for end of line(CRLF) while reading input from file

I am new to python re module, trying to read from a file and count the words. But regardless of whatever pattern I give, its adding an empty string to the list of words, when it reaches end of line.

I am reading the inputfile which has EOL - CRLF

words  = re.split(r'[~\r\n]+|\.\s*|;\s*|,\s*|\s*|\.|\r\n|$', line)

Following is the input line and corresponding output.

This is a test line; to verify, the regex pattern used.

 ['This', 'is', 'a', 'test', 'line', 'to', 'verify', 'the', 'regex', 'pattern', '
used', '']

Upvotes: 0

Views: 270

Answers (2)

olfek
olfek

Reputation: 3520

What about:

re.split(r'\W(?!\Z)', line)

Output:

['This', 'is', 'a', 'test', 'line', '', 'to', 'verify', '', 'the', 'regex', 'pattern', 'used.']

Its not perfect (period with the word 'used'), but it'll do the job for counting words.

EDIT

To be honest, you should just be using a space as the delimiter, and nothing else. My answer and @CSMaverick answer don't work on for example hello-world I am. To work for all the different cases, the regex would become quite dirty. I recommend you use something as simple as re.split(r'\s', line).

Upvotes: 0

Chandu
Chandu

Reputation: 2129

You could do something like this.

line = "This is a test line; to verify, the regex pattern used."
regx = re.compile("(\w[\w']*\w|\w)")
regx.findall(line)

#output 
['This',
 'is',
 'a',
 'test',
 'line',
 'to',
 'verify',
 'the',
 'regex',
 'pattern',
 'used']

Hope it helps !

Upvotes: 1

Related Questions