Reputation: 4214
I'm trying to tokenize text files that contain useful text but also many numbers that I don't want. However, using something like [^a-zA-Z0-9]
, I retain all digits (0-9)
.
I would like to retain digits ONLY if attached to characters OR hypnenated like "24hr" or "7-days".
So, input: "There are 3, 24hr positions available 7-days a week. Call 555-1212"
Returns a list of the following tokens: There are 24hr positions available 7-days a week Call
Thanks for any help!
Upvotes: 1
Views: 1450
Reputation: 219938
\d+-?[A-Za-z]+|[A-Za-z]+-?\d+|[A-Za-z]+
See it here in action: http://regexr.com?318em
Upvotes: 3
Reputation: 4214
After lots of trial and error, this did it (note leading space):
\d[^-a-z]+ | -\d+|[^a-zA-Z0-9-]|[0-9]+-[0-9]+|\W-+|[0-9]+-\W
http://regexr.com?318hp I hope this helps anyone else who needs it. I'm using it in RapidMiner to remove unwanted tokens in text processing.
Upvotes: 0
Reputation: 61512
The square brackets [
, ]
represent something called a character class, which basically means match anything in this class. [A-Za-z0-9]
will match any combination of letters and digits.
If you want to specify order you need to remove the digits from the character class and add another character class after it.
ex:
[0-9]+-?[a-zA-Z]+|[a-zA-Z]+-?[0-9]+|[a-zA-Z]+
[a-zA-Z]+ - matches 1 or more letters
-? - optionally matches a dash
[0-9]+ - matches 1 or more digits
Upvotes: 0