Reputation: 392
I am trying to split on sentences, and also preserve dialogue markers. So a sentence like
“Dirty, Mr. Jones? Look at my shoes! Not a speck on them.” This is a non-dialogue sentence!
Should return the list
[
"“Dirty, Mr. Jones?”",
"“Look at my shoes!”",
"“Not a speck on them.”",
"This is a non-dialogue sentence!"
]
I’m struggling with preserving the end-of-sentence punctuation while preserving the period on Mr.
. I am also struggling with inserting the quotation marks, as currently the returned list is ['“Dirty, Mr. Jones”', '“Look at my shoes”', '“Not a speck on them”', '“”', 'This is a non-dialogue sentence', '']
and I don’t know why I’m getting the two empty elements. How can I fix these problems?
Here is my code (eventually this will parse the whole book but for now I’m testing it on one phrase):
def get_all_sentences(corpus):
sentences_in_paragraph = []
dialogue = False
dialogue_sentences = ""
other_sentences = ""
example_paragraph = "“Dirty, Mr. Jones? Look at my shoes! Not a speck on them.” This is a non-dialogue sentence!"
example_paragraph = example_paragraph.replace("\n", "") # remove newline
for character in example_paragraph:
if character == "“":
dialogue = True
continue
if character == "”":
dialogue = False
continue
if dialogue:
dialogue_sentences += character
else:
other_sentences += character
sentences_in_paragraph = list(map(lambda x: "“" + x.strip() + "”", re.split("(?<!Mr|Ms)(?<!Mrs)[.!?]", dialogue_sentences)))
sentences_in_paragraph += list(map(lambda x: x.strip(), re.split("(?<!Mr|Ms)(?<!Mrs)[.!?]", other_sentences)))
print(sentences_in_paragraph)
Upvotes: 1
Views: 85
Reputation: 30428
If you add print
statements to show the intermediate steps, you can see where the problem is introduced:
sentence_splitter_regex = "(?<!Mr|Ms)(?<!Mrs)[.!?]"
dialogue_sentences_list = re.split(sentence_splitter_regex, dialogue_sentences)
print("dialogue sentences:", dialogue_sentences_list)
other_sentences_list = re.split(sentence_splitter_regex, other_sentences)
print("other sentences:", other_sentences_list)
sentences_in_paragraph = list(map(lambda x: "“" + x.strip() + "”", dialogue_sentences_list))
sentences_in_paragraph += list(map(lambda x: x.strip(), other_sentences_list))
dialogue sentences ['Dirty, Mr. Jones', ' Look at my shoes', ' Not a speck on them', '']
other sentences [' This is a non-dialogue sentence', '']
The re.split
is leaving an empty element at the end. You can fix this by processing the result using a for
comprehension with an if
clause to not include empty strings:
[sentence for sentence in sentences_with_whitespace if sentence.strip() != '']
You should put this code inside a new function split_sentences_into_list
to keep your code organized. It also makes sense to move the .strip()
processing from get_all_sentences
into this function, by changing the first part of the for
comprehension to sentence.strip()
.
import re
def split_sentences_into_list(sentences_string):
sentence_splitter_regex = "(?<!Mr|Ms)(?<!Mrs)[.!?]"
sentences_with_whitespace = re.split(sentence_splitter_regex, sentences_string)
return [sentence.strip() for sentence in sentences_with_whitespace if sentence.strip() != '']
def get_all_sentences(corpus):
sentences_in_paragraph = []
dialogue = False
dialogue_sentences = ""
other_sentences = ""
example_paragraph = "“Dirty, Mr. Jones? Look at my shoes! Not a speck on them.” This is a non-dialogue sentence!"
example_paragraph = example_paragraph.replace("\n", "") # remove newline
for character in example_paragraph:
if character == "“":
dialogue = True
continue
if character == "”":
dialogue = False
continue
if dialogue:
dialogue_sentences += character
else:
other_sentences += character
dialogue_sentences_list = split_sentences_into_list(dialogue_sentences)
other_sentences_list = split_sentences_into_list(other_sentences)
sentences_in_paragraph = list(map(lambda x: "“" + x + "”", dialogue_sentences_list))
sentences_in_paragraph += other_sentences_list
print(sentences_in_paragraph)
get_all_sentences(None)
This has the expected output:
['“Dirty, Mr. Jones”', '“Look at my shoes”', '“Not a speck on them”', 'This is a non-dialogue sentence']
By the way, standard Python style is to use for
comprehensions instead of map
and lambda
when possible. It would make your code shorter in this case:
# from
sentences_in_paragraph = list(map(lambda x: "“" + x + "”", dialogue_sentences_list))
# to
sentences_in_paragraph = ["“" + x + "”" for x in dialogue_sentences_list]
Upvotes: 2