Reputation: 433
I am trying to create a regex pattern that will split a string into an array of words based on many different patterns and conventions. The rules are as follows:
If it is working properly, the following should be true
"theQuick--brown_fox JumpsOver___the.lazy DOG".split_words ==
["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
So far, I have been able to get almost there, with the only issue being that it splits on every capital, so "DOG".split_words is ["d", "o", "g"] and not ["dog"]
I also use a combination of regex and maps/filters on the split array to get to the solution, bonus points if you can tell me how to get rid of that and use only regex.
Here's what I have so far:
class String
def split_words
split(/[_,\-, ,.]|(?=[A-Z]+)/).
map(&:downcase).
reject(&:empty?)
end
end
Which when called on the string from the test above returns:
["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "d", "o", "g"]
How can I update this method to meet all of the above specs?
Upvotes: 2
Views: 699
Reputation: 26778
You can slightly change the regex so it doesn't split on every capital, but every sequence of letters that starts with a capital. This just involves putting a [a-z]+
after the [A-Z]+
string = "theQuick--brown_fox JumpsOver___the.lazy DOG"
regex = /[_,\-, ,.]|(?=[A-Z]+[a-z]+)/
string.split(regex).reject(&:empty?)
# => ["the", "Quick", "brown", "fox", "Jumps", "Over", "the", "lazy", "DOG"]
Upvotes: 5
Reputation: 110725
r = /
[- _.]+ # match one or more combinations of dashes, spaces,
# underscores and periods
| # or
(?<=\p{Ll}) # match a lower case letter in a positive lookbehind
(?=\p{Lu}) # match an upper case letter in a positive lookahead
/x # free-spacing regex definition mode
str = "theQuick--brown_dog, JumpsOver___the.--lazy FOX for $5"
str.split(r).map(&:downcase)
#=> ["the", "quick", "brown", "dog,", "jumps", "over", "the", "lazy",
"fox", "for", "$5"]
If the string is to be broken on spaces and all punctuation characters, replace [- _.]+
with [ [:punct:]]+
. Search for "[[:punct:]]"
at Regexp for the reference.
Upvotes: 2
Reputation: 627082
You may use a matching approach to extract chunks of 2 or more uppercase letters or a letter followed only with 0+ lowercase letters:
s.scan(/\p{Lu}{2,}|\p{L}\p{Ll}*/).map(&:downcase)
See the Ruby demo and the Rubular demo.
The regex matches:
\p{Lu}{2,}
- 2 or more uppercase letters|
- or \p{L}
- any letter\p{Ll}*
- 0 or more lowercase letters.With map(&:downcase)
, the items you get with .scan()
are turned to lower case.
Upvotes: 4