Splitting sentences from prose

Question

I am trying to split on sentences, and also preserve dialogue markers. So a sentence like

“Dirty, Mr. Jones? Look at my shoes! Not a speck on them.” This is a non-dialogue sentence!

Should return the list

[
    "“Dirty, Mr. Jones?”",
    "“Look at my shoes!”",
    "“Not a speck on them.”",
    "This is a non-dialogue sentence!"
]

I’m struggling with preserving the end-of-sentence punctuation while preserving the period on Mr.. I am also struggling with inserting the quotation marks, as currently the returned list is ['“Dirty, Mr. Jones”', '“Look at my shoes”', '“Not a speck on them”', '“”', 'This is a non-dialogue sentence', ''] and I don’t know why I’m getting the two empty elements. How can I fix these problems?

Here is my code (eventually this will parse the whole book but for now I’m testing it on one phrase):

def get_all_sentences(corpus):

  sentences_in_paragraph = []

  dialogue = False
  dialogue_sentences = ""
  other_sentences = ""

  example_paragraph = "“Dirty, Mr. Jones? Look at my shoes! Not a speck on them.”  This is a non-dialogue sentence!"

  example_paragraph = example_paragraph.replace("
", "") # remove newline

  for character in example_paragraph:
    if character == "“":
        dialogue = True
        continue
    if character == "”":
        dialogue = False
        continue

    if dialogue:
        dialogue_sentences += character
    else:
        other_sentences += character

  sentences_in_paragraph  = list(map(lambda x: "“" + x.strip() + "”", re.split("(?

Rory O&#39;Kane · Accepted Answer

If you add print statements to show the intermediate steps, you can see where the problem is introduced:

sentence_splitter_regex = "(?



dialogue sentences ['Dirty, Mr. Jones', ' Look at my shoes', ' Not a speck on them', '']
other sentences ['    This is a non-dialogue sentence', '']


The re.split is leaving an empty element at the end. You can fix this by processing the result using a for comprehension with an if clause to not include empty strings:

[sentence for sentence in sentences_with_whitespace if sentence.strip() != '']


You should put this code inside a new function split_sentences_into_list to keep your code organized. It also makes sense to move the .strip() processing from get_all_sentences into this function, by changing the first part of the for comprehension to sentence.strip().

import re

def split_sentences_into_list(sentences_string):
    sentence_splitter_regex = "(?


This has the expected output:

['“Dirty, Mr. Jones”', '“Look at my shoes”', '“Not a speck on them”', 'This is a non-dialogue sentence']


By the way, standard Python style is to use for comprehensions instead of map and lambda when possible. It would make your code shorter in this case:

# from
sentences_in_paragraph  = list(map(lambda x: "“" + x + "”", dialogue_sentences_list)) 
# to
sentences_in_paragraph  = ["“" + x + "”" for x in dialogue_sentences_list]

Splitting sentences from prose

Answers (1)

Related Questions