Python3 remove multiple hyphenations from a german string

Question

I'm currently working on a neural network that evaluates students' answers to exam questions. Therefore, preprocessing the corpora for a Word2Vec network is needed. Hyphenation in german texts is quite common. There are mainly two different types of hyphenation:

1) End of line:

The text reaches the end of the line so the last word is sepa- rated.

2) Short form of enumeration:

in case of two "elements":

Geistes- und Sozialwissenschaften

more "elements":

Wirtschafts-, Geistes- und Sozialwissenschaften

The de-hyphenated form of these enumerations should be:

Geisteswissenschaften und Sozialwissenschaften

Wirtschaftswissenschaften, Geisteswissenschaften und Sozialwissenschaften

I need to remove all hyphenations and put the words back together. I already found several solutions for the first problem.

But I have absoluteley no clue how to get the second part (in the example above "wissenschaften") of the words in the enumeration problem. I don't even know if it is possible at all.

I hope that I have pointet out my problem properly.

So has anyone an idea how to solve this problem?

Thank you very much in advance!

Python3 remove multiple hyphenations from a german string

Answers (1)

Related Questions