Reputation: 1
I'm currently working on a neural network that evaluates students' answers to exam questions. Therefore, preprocessing the corpora for a Word2Vec network is needed. Hyphenation in german texts is quite common. There are mainly two different types of hyphenation:
1) End of line:
The text reaches the end of the line so the last word is sepa-
rated.
2) Short form of enumeration:
in case of two "elements":
Geistes- und Sozialwissenschaften
more "elements":
Wirtschafts-, Geistes- und Sozialwissenschaften
The de-hyphenated form of these enumerations should be:
Geisteswissenschaften und Sozialwissenschaften
Wirtschaftswissenschaften, Geisteswissenschaften und Sozialwissenschaften
I need to remove all hyphenations and put the words back together. I already found several solutions for the first problem.
But I have absoluteley no clue how to get the second part (in the example above "wissenschaften") of the words in the enumeration problem. I don't even know if it is possible at all.
I hope that I have pointet out my problem properly.
So has anyone an idea how to solve this problem?
Thank you very much in advance!
Upvotes: 0
Views: 110
Reputation: 54193
It's surely possible, as the pattern seems fairly regular. (Something vaguely analogous is sometimes seen in English. For example: The new requirements applied to under-, over-, and average-performing employees.
)
The rule seems to be roughly, "when you see word-fragments with a trailing hyphen, and then an und
, look for known words that begin with the word-fragments, and end the same as the terminal-word-after-und
– and replace the word-fragments with the longer words".
Not being a German speaker and without language-specific knowledge, it wouldn't be possible to know exactly where breaks are appropriate. That is, in your Geistes- und Sozialwissenschaften
example, without language-specific knowledge, it's unclear whether the first fragment should become Geisteszialwissenschaften
or Geisteswissenschaften
or Geistesenschaften
or Geiestesaften
or any other shared-suffix with Sozialwissenschaften
. But if you've got a dictionary of word-fragments, or word-frequency info from other text that uses the same full-length word(s) without this particular enumeration-hyphenation, that could help choose.
(If there's more than one plausible suffix based on known words, this might even be a possible application of word2vec: the best suffix to choose might well be the one that creates a known-word that is closest to the terminal-word in word-vector-space.)
Since this seems a very German-specific issue, I'd try asking in forums specific to German natural-language-processing, or to libraries with specific German support. (Maybe, NLTK or Spacy?)
But also, knowing word2vec, this sort of patch-up may not actually be that important to your end-goals. Training without this logical-reassembly of the intended full words may still let the fragments achieve useful vectors, and the corresponding full words may achieve useful vectors from other usages. The fragments may wind up close enough to the full compound words that they're "good enough" for whatever your next regression/classifier step does. So if this seems a blocker, don't be afraid to just try ignoring it as a non-problem. (Then if you later find an adequate de-hyphenation approach, you can test whether it really helped or not.)
Upvotes: 1