Reputation: 181
I have some text scraped from pdfs and I have parsed out the text and currently have everything as strings in a list. I would like to join together sentences that were returned as separate strings because of breaks on the pdf page. For example,
list = ['I am a ', 'sentence.', 'Please join me toge-', 'ther. Thanks for your help.']
I would like to have:
list = ['I am a sentence.', 'Please join me together. Thanks for your help.']
I currently have the following code which joins some sentences but the second sub sentence that joined to the first is still returned. I am aware this is due to indexing but am not sure how to fix the issue.
new = []
def cleanlist(dictlist):
for i in range(len(dictlist)):
if i>0:
if dictlist[i-1][-1:] != ('.') or dictlist[i-1][-1:] != ('. '):
new.append(dictlist[i-1]+dictlist[i])
elif dictlist[i-1][-1:] == '-':
new.append(dictlist[i-1]+dictlist[i])
else:
new.append[dict_list[i]]
Upvotes: 1
Views: 159
Reputation: 7206
You could use a generator approach:
def cleanlist(dictlist):
current = []
for line in dictlist:
if line.endswith("-"):
current.append(line[:-1])
elif line.endswith(" "):
current.append(line)
else:
current.append(line)
yield "".join(current)
current = []
Use it like this:
dictlist = ['I am a ', 'sentence.', 'Please join me toge-', 'ther. Thanks for your help.']
print(list(cleanlist(dictlist)))
# ['I am a sentence.', 'Please join me together. Thanks for your help.']
Upvotes: 1