Reputation: 15985
Case 1(Trailing space)
> "on behalf of all of us ".split(/\W+/)
=> ["on", "behalf", "of", "all", "of", "us"]
but if there is leading space then it gives following
Case 2(Leading space)
> " on behalf of all of us".split(/\W+/)
=> ["", "on", "behalf", "of", "all", "of", "us"]
I was expecting result of Case 1 for Case 2 also.
ADDED
> "@dhh congratulations!!".split(/\W+/)
=> ["", "dhh", "congratulations"]
Would anyone please help me to understand the behavior?
Upvotes: 3
Views: 900
Reputation: 15985
Just for documentation, following works for me
" @dhh congratulations!!".gsub(/^\W+/,'').split /\W+/
Another one
" @dhh congratulations!!".scan /\w+/
Both gives expected results. However there is a caveat for short forms like
> " Don't be shy.".scan /\w+/
=> ["Don", "t", "be", "shy"]
I am actually collecting words which are not articles, conjunctions, prepositions etc. So anyway I am ignoring such short forms and hence I used this solution.
I am preparing words cloud from tweets. If you know any proven algorithm please share.
Upvotes: 0
Reputation: 80065
From the docs:
split(pattern=$;, [limit]) → anArray
[...] If the limit parameter is omitted, trailing null fields are suppressed. If limit is a positive number, at most that number of fields will be returned (if limit is 1, the entire string is returned as the only entry in an array). If negative, there is no limit to the number of fields returned, and trailing null fields are not suppressed.
Upvotes: 1
Reputation: 34145
[Update]
Skip regex, just Split on space!
> "@dhh congratulations!!".split
=> ["@dhh", "congratulations"]
\W
matches any non-word character including space. so as the parser sees a space in start & some chars AFTER the space; it splits. But if the space it at the end, there is no other wordy char[a-zA-Z0-9]
present to split with.
To get consistent behavior, you should remove whitespaces using #strip
method.
Case 1(Trailing space)
1.9.3p327 :007 > " on behalf of all of us ".strip.split(/\W+/)
=> ["on", "behalf", "of", "all", "of", "us"]
Case 2(Leading space)
1.9.3p327 :008 > "on behalf of all of us ".strip.split(/\W+/)
=> ["on", "behalf", "of", "all", "of", "us"]
Upvotes: 4