S Dub
S Dub

Reputation: 63

Why isn't this regex parsing the whole string?

Writing a simple script to parse a large text file into words, their parent sentences, and some metadata (are they within a quote, etc.). Trying to get the regex to function properly and running into a strange issue. Here's a small bit of test code showing what's going on with my parsing. The white space is intentional, but I can't understand why the last 'word' is not parsing. It is not preceded by any problematic characters (at least as far as I can tell using repr) and when I run parse() on just the problem 'word' it returns the expected array of single words and spaces.

Code:

def parse(new_line):
    new_line = new_line.rstrip()
    word_array = re.split('([\.\?\!\ ])',new_line,re.M)
    print(word_array)

x = full_text.readline()
print(repr(x))
parse(x)

Output:

'Far out in the uncharted backwaters of the unfashionable end of the western spiral arm of the Galaxy\n'

['Far', ' ', 'out', ' ', 'in', ' ', 'the', ' ', 'uncharted', ' ', 'backwaters', ' ', 'of', ' ', 'the', ' ', 'unfashionable end of the western spiral arm of the Galaxy']

Upvotes: 2

Views: 72

Answers (1)

Tim Peters
Tim Peters

Reputation: 70602

re.M is 8, and you're passing that as the maxsplit positional argument. You want flags=re.M instead.

Upvotes: 4

Related Questions