Reputation: 2782
I've got a largish dataset of German text which was generated with some encoding problems and I can't recreate the dataset from scratch. So, I've discovered that in cases where German special characters should be the string "??" appears in it's place (I'm guessing this came from treating UTF8 as Ascii or something along those lines).
The dataset is in the form of a series of CSV files containing around 180,000 lines. My solution is to identify all unique words which contain "??" and basically do a string replace. Fortunately, there aren't so many unique words to replace (18 words from a sample of approx. 5% of the dataset).
I've managed to get a regular expression which matches words containing exactly one instance of "??" - the problem is that it splits words which contain more than once instance of "??" into two partial matches.
At this stage I'm kind of reaching the limits of my Regular Expression knowledge. I guess this needs to do some kind of look-ahead but I'm not sure how to go about it.
Here's my regular expression: "@"(?<TM>\w*\?\?\w*)"
.
Here's an example string (note the second word will get split into two matches: "hellgr??n Hei??folienflachpr??gung Folienpr??gung,"
Upvotes: 0
Views: 519
Reputation: 112352
(?<TM>\w*(\?\?\w*)+)
Repeat the part containing the ?? at least once.
Upvotes: 2