Reputation: 9
I'm trying to capture every word in a .txt document.
Words are defined as any string of unbroken characters and hyphens, which may have an apostrophe (both apostrophe and "RIGHT SINGLE QUOTATION MARK" characters are captured due to the input being able to use either character) or, as a regular expression:
[a-zA-Z\-]+['a-zA-Z\-\’\']*
Now this seems to work in several online Regex testing web-app thingos, but it just does not seem to want to work in my C# code and I don't understand why:
MatchCollection matches = Regex.Matches(input_String.ToLowerInvariant(),
@"[a-zA-Z\-]+['a-zA-Z\-\’\']*");
string[] sorting_String = matches.Cast<Match>().Select(match => match.Value).ToArray();
When a word like "I'm" is contained in the text, it's returning "i" and "m" as separate words, rather then the intended single entry "i'm".
I haven't found anything from googling this time, and since it DOES work as intended in the online testers... and I can't figure out if it's an escape issue... I'm stumped.
Could someone explain to me why it isn't returning what I expect in C#? Or at least, with the System.Text.RegularExpressions library? I assume it's just me being silly/ignorant.
EDIT 1: Here is a screen shot of the locals showing the issue - Image of Locals It should be "book's". Huh, I just inspected my input string variable, and it looks like I'm getting stuff like this: Image of encoding issue? maybe?
Ehhhh, the input is a .txt file - and it's formatting is retained in the file... so something is happening in my code that's not playing nice... uh, at least, that's where I'm guessing the issue is at now... I'm not an expert at this XD. Um sorry to be a bother, but could I be pointed in the direction of resources that could assist me with this?
Upvotes: 0
Views: 442
Reputation: 86
You can try this [\w\'\-]+[\w\'\-]*
and see if it works
I think you should escape the first '
on the second bracket.
Upvotes: 2