Reputation: 563
I need to capture every word in a text file separately. The words can be like ordinary words, numbers, numbers containing hyphens etc.
My criteria for a word is that whatever it may be it will either be separated by a space before another word or the word will end with dot.
I tried with @"(\w+)+"
in C# but it fails to capture every word as defined above as well as things like +-.,!@#$%^&*();\/|<>"'
.
The purpose is to create a unique list of words.
Upvotes: 0
Views: 461
Reputation: 7057
Try DEMO
([^\s\.]+)\.?
means:
( - beginning of capture
[ - one of..
^ - none of the following characters
\s - a space character (tab, space etc)
\. - a literal dot
]
+ - one or more of the previous block (in []) in a greedy way
) - close of capture block
\. - a literal dot
? - zero or one
Which matches multiple non spaces (and not a dot), which could end in a dot (but will never include it).
Upvotes: 2
Reputation: 10163
Regex contains a "word boundary" character (\b
). This includes spaces and punctuation. Since your criteria includes numbers (is it ASCII-only?) this is probably the best solution for your specific case.
You can try this regex: \b([^\b]+)\b
This matches a word-boundary, and then one or more non-boundary characters, up to the next word boundary.
Upvotes: 0
Reputation: 1064
use string.split() and define your delimiters to space, dot and/or new line. you can use any Regex as delimiter as well.
https://msdn.microsoft.com/en-us/library/b873y76a(v=vs.110).aspx
Upvotes: 0
Reputation: 42669
You want [^.\s]+
which matches any sequence of characters which are not whitespace or dot.
Upvotes: 2