Josh Hadik
Josh Hadik

Reputation: 433

How can I use regex in Ruby to split a string into an array of the words it contains?

I am trying to create a regex pattern that will split a string into an array of words based on many different patterns and conventions. The rules are as follows:

  1. It must split the string on all dashes, spaces, underscores, and periods.
  2. When multiple of the aforementioned characters show up together, it must only split once (so 'the--.quick' must split to ['the', 'quick'] and not ['the', '', '', 'quick'] )
  3. It must split the string on new capital letters, while keeping that letter with its corresponding word ('theQuickBrown' splits to ['the', 'quick', 'brown']
  4. It must group multiple uppercase letters in a row together ('LETS_GO' must split to ['lets', 'go'], not ['l', 'e', 't', 's', 'g', 'o'])
  5. It must use only lowercase letters in the split array.

If it is working properly, the following should be true

"theQuick--brown_fox JumpsOver___the.lazy  DOG".split_words == 
["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

So far, I have been able to get almost there, with the only issue being that it splits on every capital, so "DOG".split_words is ["d", "o", "g"] and not ["dog"]

I also use a combination of regex and maps/filters on the split array to get to the solution, bonus points if you can tell me how to get rid of that and use only regex.

Here's what I have so far:

class String
  def split_words 
    split(/[_,\-, ,.]|(?=[A-Z]+)/).
    map(&:downcase).
    reject(&:empty?)
  end 
end

Which when called on the string from the test above returns:

["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "d", "o", "g"]

How can I update this method to meet all of the above specs?

Upvotes: 2

Views: 699

Answers (3)

max pleaner
max pleaner

Reputation: 26778

You can slightly change the regex so it doesn't split on every capital, but every sequence of letters that starts with a capital. This just involves putting a [a-z]+ after the [A-Z]+

string = "theQuick--brown_fox JumpsOver___the.lazy  DOG"
regex = /[_,\-, ,.]|(?=[A-Z]+[a-z]+)/
string.split(regex).reject(&:empty?)
# => ["the", "Quick", "brown", "fox", "Jumps", "Over", "the", "lazy", "DOG"]

Upvotes: 5

Cary Swoveland
Cary Swoveland

Reputation: 110725

r = /
    [- _.]+      # match one or more combinations of dashes, spaces,
                 # underscores and periods
    |            # or
    (?<=\p{Ll})  # match a lower case letter in a positive lookbehind
    (?=\p{Lu})   # match an upper case letter in a positive lookahead
    /x           # free-spacing regex definition mode

str = "theQuick--brown_dog, JumpsOver___the.--lazy   FOX for $5"

str.split(r).map(&:downcase)
  #=> ["the", "quick", "brown", "dog,", "jumps", "over", "the", "lazy",
       "fox", "for", "$5"]

If the string is to be broken on spaces and all punctuation characters, replace [- _.]+ with [ [:punct:]]+. Search for "[[:punct:]]" at Regexp for the reference.

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627082

You may use a matching approach to extract chunks of 2 or more uppercase letters or a letter followed only with 0+ lowercase letters:

s.scan(/\p{Lu}{2,}|\p{L}\p{Ll}*/).map(&:downcase)

See the Ruby demo and the Rubular demo.

The regex matches:

  • \p{Lu}{2,} - 2 or more uppercase letters
  • | - or
  • \p{L} - any letter
  • \p{Ll}* - 0 or more lowercase letters.

With map(&:downcase), the items you get with .scan() are turned to lower case.

Upvotes: 4

Related Questions