Reputation: 462
I converted a .pdf file into .txt using Python. It is fairly easy to "clean" the text by removing special characters or certain characters that I don't want, however I have an interesting problem that I haven't managed to figure out other than manually.
The text is in German and some words are broken into syllables (they were probably like that in the original .pdf). So I have stuff like
Das ist die Belastung eines Grundstücks mit der Haftung für bestimmte, in der Regel wiederkeh-
rende Leistungen des jeweiligen Grundeigentümers.
It it not a good idea to just delete the hyphens because sometimes they make sense, such as in Verkehrs- und Tarifverbund Stuttgart
.
Is there any way to avoid doing it manually? It happens in almost every sentence.
Upvotes: 0
Views: 43
Reputation: 1121
If the word was split due to it being too long and at the end of the line, you should be able to just remove "-\n"
(replace it with ""
).
If your document uses some other special character to indicate the end of line, you need to replace \n
with that instead.
Upvotes: 1