Ian
Ian

Reputation: 160

how to restore a splitted word by removing the hyphen "-" because of hyphenation in a paragraph using python

simple example: func-tional --> functional

The story is that I got a Microsoft Word document, which is converted from PDF format, and some words remain hyphenated (such as func-tional, broken because of line break in PDF). I want to recover those broken words while normal ones(i.e., "-" is not for word-break) are kept.

In order to make it more clear, one long example (source text) is added:

After the symposium, the Foundation and the FCF steering team continued their work and created the Func-tional Check Flight Compendium. This compendium contains information that can be used to reduce the risk of functional check flights. The information contained in the guidance document is generic, and may need to be adjusted to apply to your specific aircraft. If there are questions on any of the information in the compendi-um, contact your manufacturer for further guidance.

Could someone give me some suggestions on this problem?

Upvotes: 2

Views: 1176

Answers (1)

Sharku
Sharku

Reputation: 1092

I would use regular expression. This little script searches for words with hyphenated and replaces the hyphenated by nothing.

import re


def replaceHyphenated(s):
    matchList = re.findall(r"\w+-\w+",s) # find combination of word-word 
    sOut = s
    for m in matchList:
        new = m.replace("-","")
        sOut = sOut.replace(m,new)
    return sOut



if __name__ == "__main__":

    s = """After the symposium, the Foundation and the FCF steering team continued their work and created the Func-tional Check Flight Compendium. This compendium contains information that can be used to reduce the risk of functional check flights. The information contained in the guidance document is generic, and may need to be adjusted to apply to your specific aircraft. If there are questions on any of the information in the compendi-um, contact your manufacturer for further guidance."""    
    print(replaceHyphenated(s))

output would be:

After the symposium, the Foundation and the FCF steering team continued their work and created the Functional Check Flight Compendium. This compendium contains information that can be used to reduce the risk of functional check flights. The information contained in the guidance document is generic, and may need to be adjusted to apply to your specific aircraft. If there are questions on any of the information in the compendium, contact your manufacturer for further guidance.

If you are not used to RegExp I recommend this site: https://regex101.com/

Upvotes: 2

Related Questions