JSC
JSC

Reputation: 181

Join together sentences from parsed pdf

I have some text scraped from pdfs and I have parsed out the text and currently have everything as strings in a list. I would like to join together sentences that were returned as separate strings because of breaks on the pdf page. For example,

list = ['I am a ', 'sentence.', 'Please join me toge-', 'ther. Thanks for your help.'] 

I would like to have:

list = ['I am a sentence.', 'Please join me together. Thanks for your help.'] 

I currently have the following code which joins some sentences but the second sub sentence that joined to the first is still returned. I am aware this is due to indexing but am not sure how to fix the issue.

new = []

def cleanlist(dictlist):
    for i in range(len(dictlist)):

    if i>0:

        if dictlist[i-1][-1:] != ('.') or dictlist[i-1][-1:] != ('. '):
            new.append(dictlist[i-1]+dictlist[i])

        elif dictlist[i-1][-1:] == '-':
            new.append(dictlist[i-1]+dictlist[i])

        else:
            new.append[dict_list[i]] 

Upvotes: 1

Views: 159

Answers (1)

Graipher
Graipher

Reputation: 7206

You could use a generator approach:

def cleanlist(dictlist):
    current = []
    for line in dictlist:
        if line.endswith("-"):
            current.append(line[:-1])
        elif line.endswith(" "):
            current.append(line)
        else:
            current.append(line)
            yield "".join(current)
            current = []

Use it like this:

dictlist = ['I am a ', 'sentence.', 'Please join me toge-', 'ther. Thanks for your help.']
print(list(cleanlist(dictlist)))
# ['I am a sentence.', 'Please join me together. Thanks for your help.']

Upvotes: 1

Related Questions