Unnikrishnan
Unnikrishnan

Reputation: 563

Regex for any word in text file

I need to capture every word in a text file separately. The words can be like ordinary words, numbers, numbers containing hyphens etc.

My criteria for a word is that whatever it may be it will either be separated by a space before another word or the word will end with dot.

I tried with @"(\w+)+" in C# but it fails to capture every word as defined above as well as things like +-.,!@#$%^&*();\/|<>"'.

The purpose is to create a unique list of words.

Upvotes: 0

Views: 461

Answers (4)

ergonaut
ergonaut

Reputation: 7057

Try DEMO

([^\s\.]+)\.?

means:

(    - beginning of capture
 [   - one of..
  ^  - none of the following characters
  \s - a space character (tab, space etc)
  \. - a literal dot
 ]
 +   - one or more of the previous block (in []) in a greedy way
)    - close of capture block
\.   - a literal dot
?    - zero or one

Which matches multiple non spaces (and not a dot), which could end in a dot (but will never include it).

Upvotes: 2

ashes999
ashes999

Reputation: 10163

Regex contains a "word boundary" character (\b). This includes spaces and punctuation. Since your criteria includes numbers (is it ASCII-only?) this is probably the best solution for your specific case.

You can try this regex: \b([^\b]+)\b

This matches a word-boundary, and then one or more non-boundary characters, up to the next word boundary.

Upvotes: 0

Pramuka
Pramuka

Reputation: 1064

use string.split() and define your delimiters to space, dot and/or new line. you can use any Regex as delimiter as well.

https://msdn.microsoft.com/en-us/library/b873y76a(v=vs.110).aspx

Upvotes: 0

JacquesB
JacquesB

Reputation: 42669

You want [^.\s]+ which matches any sequence of characters which are not whitespace or dot.

Upvotes: 2

Related Questions