Reputation: 23
How do I convert a String: 'Hello world!'
to an array: ['Hello', ' ', 'world!']
with all spaces preserved?
I tried to convert the string using the split
method with different parameters, but I didn't find the right solution.
Also I didn't find any other method in the documentation (Class: String (Ruby 3.1.0)) suitable for solving this problem.
Upvotes: 2
Views: 841
Reputation:
You can continue to use split
and still preserve spaces by using a simple regex
with a capture group
:
"Hello World ! ".split(/( +)/)
#=> ["Hello", " ", "World", " ", "!", " "]
The only catch I'm aware of is that strings starting with a space will result in an array that starts with an empty string:
" Hello World ! ".split(/( +)/)
#=> ["", " ", "Hello", " ", "World", " ", "!", " "]
IF this is a problem, you can add something like drop_while
to the mix:
" Hello World ! ".split(/( +)/).drop_while(&:empty?)
#=> [" ", "Hello", " ", "World", " ", "!", " "]
Upvotes: 1
Reputation: 84343
You don't want to use String#split because that won't preserve your spaces. You want to use String#scan or String#partition instead. Using Unicode character properties, you can scan for matches with:
'Hello world!'.scan /[\p{Alnum}\p{Punct}]+|\p{Space}+/
#=> ["Hello", " ", "world!"]
You can also use POSIX character classes (pronounced "bracket expressions" in Ruby) to do the same thing if you prefer. For example:
'Hello world!'.scan /[[:alnum:][:punct:]]+|[[:space:]]+/
#=> ["Hello", " ", "world!"]
Either of these options will be more robust than solutions that rely on ASCII-only characters or literal whitespace atoms, but if you know your strings won't include other types of characters or encodings then those solutions will work too.
If you're looking for brevity in your regular expression, and you're sure you won't need to concern yourself with Unicode characters or explicitly differentiating between non-whitespace characters and punctuation, you can also use the \s
and \S
metacharacters. For example:
'Hello world!'.scan /\s+|\S+/
#=> ["Hello", " ", "world!"]
This is generally less robust than the character properties or bracket expressions above, but is still unambiguous, short, and easy to read. It fits your example, so it's worth mentioning, but the \S
metacharacter can match control characters and other unexpected things, so you need to be cautious with it unless you really know your data. For example, your string might contain an invisible NUL or a control character like CTRL-D, in which case \S
would catch it and return a Unicode-escaped character:
"\x00".scan /\S+/
#=> ["\u0000"]
?\C-D.scan /\S+/
#=> ["\u0004"]
This is probably not what you'd expect, but given a larger data set this type of thing inevitably happens. The more explicit you can be, the fewer problems you're likely to have with your production data.
For the very simple use case in your original example, you only have two words separated by whitespace. That means you can also use String#partition to partition on the sequential whitespace. That will split the string into exactly three elements, preserving the whitespace that partitions the words. For example:
'Hello world!'.partition /\s+/
#=> ["Hello", " ", "world!"]
While simpler, the partitioning approach won't work as well with longer strings such as:
'Goodbye cruel world!'.partition /\s+/
#=> ["Goodbye", " ", "cruel world!"]
so String#scan is going to be a better and more flexible approach for the general use case. However, anytime you want to split a string into three elements, or to preserve the partitioning element itself, #partition can be very handy.
Upvotes: 3
Reputation: 22225
It just occured to me, that you could use scan
. Assuming that your string is stored in the variable s
, and you want to separate space regions and non-space regions, you could do a
s.scan(/[ ]+|[^ ]+/)
which would yield in your case
["Hello", " ", "world!"]
Upvotes: 4