futuredataengineer
futuredataengineer

Reputation: 462

Correcting words broken into syllables in a text

I converted a .pdf file into .txt using Python. It is fairly easy to "clean" the text by removing special characters or certain characters that I don't want, however I have an interesting problem that I haven't managed to figure out other than manually.

The text is in German and some words are broken into syllables (they were probably like that in the original .pdf). So I have stuff like

Das ist die Belastung eines Grundstücks mit der Haftung für bestimmte, in der Regel wiederkeh-
rende Leistungen des jeweiligen Grundeigentümers.

It it not a good idea to just delete the hyphens because sometimes they make sense, such as in Verkehrs- und Tarifverbund Stuttgart.

Is there any way to avoid doing it manually? It happens in almost every sentence.

Upvotes: 0

Views: 43

Answers (1)

Mahrkeenerh
Mahrkeenerh

Reputation: 1121

If the word was split due to it being too long and at the end of the line, you should be able to just remove "-\n" (replace it with "").

If your document uses some other special character to indicate the end of line, you need to replace \n with that instead.

Upvotes: 1

Related Questions