CodeHard
CodeHard

Reputation: 125

Python How do I merge hyphenated words with newlines?

I want to say that Napp Granade
serves in the spirit of a town in our dis-
trict of Georgia called Andersonville.

I have thousands of text files with data like the above and words have been wrapped using hyphens and newlines.

What I am trying to do is remove the hyphen and place the newline at the end of the word. I do not want to remove all hyphenated words if possible only those that are at the end of the line.

            with open(filename, encoding="utf8") as f:
              file_str = f.read()


            re.sub("\s*-\s*", "", file_str)

            with open(filename, "w", encoding="utf8") as f:
              f.write(file_str)

The above code is not working and I have tried in several different ways.

I would want to go through the entire text file and remove all hyphens that denoted a newline. Such as:

I want to say that Napp Granade
serves in the spirit of a town in our district
of Georgia called Andersonville.

Any help would be appreciated.

Upvotes: 4

Views: 1521

Answers (1)

Thierry Lathuille
Thierry Lathuille

Reputation: 24232

You don't need to use a regex:

filename = 'test.txt'

# I want to say that Napp Granade
# serves in the spirit of a town in our dis-
# trict of Georgia called Anderson-
# ville.

with open(filename, encoding="utf8") as f:
    lines = [line.strip('\n') for line in f]
    for num, line in enumerate(lines):
        if line.endswith('-'):
            # the end of the word is at the start of next line
            end = lines[num+1].split()[0]
            # we remove the - and append the end of the word
            lines[num] = line[:-1] + end
            # and remove the end of the word and possibly the 
            # following space from the next line
            lines[num+1] = lines[num+1][len(end)+1:]

    text = '\n'.join(lines)

with open(filename, "w", encoding="utf8") as f:
    f.write(text)


# I want to say that Napp Granade
# serves in the spirit of a town in our district
# of Georgia called Andersonville.

But you can, of course, and it's shorter:

with open(filename, encoding="utf8") as f:
    text = f.read()

text = re.sub(r'-\n(\w+ *)', r'\1\n', text)

with open(filename, "w", encoding="utf8") as f:
        f.write(text)

We look for a - followed by \n, and capture the following word, which is the end of the split word.
We replace all that by the captured word followed by a newline.

Don't forget to use raw strings for the replacement, in order for \1 to be interpreted correctly.

Upvotes: 6

Related Questions