Reputation: 63
Writing a simple script to parse a large text file into words, their parent sentences, and some metadata (are they within a quote, etc.). Trying to get the regex to function properly and running into a strange issue. Here's a small bit of test code showing what's going on with my parsing. The white space is intentional, but I can't understand why the last 'word' is not parsing. It is not preceded by any problematic characters (at least as far as I can tell using repr) and when I run parse() on just the problem 'word' it returns the expected array of single words and spaces.
Code:
def parse(new_line):
new_line = new_line.rstrip()
word_array = re.split('([\.\?\!\ ])',new_line,re.M)
print(word_array)
x = full_text.readline()
print(repr(x))
parse(x)
Output:
'Far out in the uncharted backwaters of the unfashionable end of the western spiral arm of the Galaxy\n'
['Far', ' ', 'out', ' ', 'in', ' ', 'the', ' ', 'uncharted', ' ', 'backwaters', ' ', 'of', ' ', 'the', ' ', 'unfashionable end of the western spiral arm of the Galaxy']
Upvotes: 2
Views: 72
Reputation: 70602
re.M
is 8, and you're passing that as the maxsplit
positional argument. You want flags=re.M
instead.
Upvotes: 4