zbinsd
zbinsd

Reputation: 4214

In Java regex - how to retain numbers ONLY when attached to string

I'm trying to tokenize text files that contain useful text but also many numbers that I don't want. However, using something like [^a-zA-Z0-9], I retain all digits (0-9).

I would like to retain digits ONLY if attached to characters OR hypnenated like "24hr" or "7-days".

So, input: "There are 3, 24hr positions available 7-days a week. Call 555-1212"

Returns a list of the following tokens: There are 24hr positions available 7-days a week Call

Thanks for any help!

Upvotes: 1

Views: 1450

Answers (3)

Joseph Silber
Joseph Silber

Reputation: 219938

\d+-?[A-Za-z]+|[A-Za-z]+-?\d+|[A-Za-z]+

See it here in action: http://regexr.com?318em

Upvotes: 3

zbinsd
zbinsd

Reputation: 4214

After lots of trial and error, this did it (note leading space):

     \d[^-a-z]+ | -\d+|[^a-zA-Z0-9-]|[0-9]+-[0-9]+|\W-+|[0-9]+-\W

http://regexr.com?318hp I hope this helps anyone else who needs it. I'm using it in RapidMiner to remove unwanted tokens in text processing.

Upvotes: 0

Hunter McMillen
Hunter McMillen

Reputation: 61512

The square brackets [, ] represent something called a character class, which basically means match anything in this class. [A-Za-z0-9] will match any combination of letters and digits.

If you want to specify order you need to remove the digits from the character class and add another character class after it.

ex:

[0-9]+-?[a-zA-Z]+|[a-zA-Z]+-?[0-9]+|[a-zA-Z]+

[a-zA-Z]+ - matches 1 or more letters
-?        - optionally matches a dash
[0-9]+    - matches 1 or more digits

Upvotes: 0

Related Questions