Amit Patel
Amit Patel

Reputation: 15985

Strange behavior while splitting string with non-word character regex

Case 1(Trailing space)

> "on behalf of all of us  ".split(/\W+/)
 => ["on", "behalf", "of", "all", "of", "us"] 

but if there is leading space then it gives following

Case 2(Leading space)

> "  on behalf of all of us".split(/\W+/)
 => ["", "on", "behalf", "of", "all", "of", "us"] 

I was expecting result of Case 1 for Case 2 also.

ADDED

> "@dhh congratulations!!".split(/\W+/)
 => ["", "dhh", "congratulations"] 

Would anyone please help me to understand the behavior?

Upvotes: 3

Views: 900

Answers (3)

Amit Patel
Amit Patel

Reputation: 15985

Just for documentation, following works for me

 " @dhh congratulations!!".gsub(/^\W+/,'').split /\W+/

Another one

 " @dhh congratulations!!".scan /\w+/

Both gives expected results. However there is a caveat for short forms like

 > " Don't be shy.".scan /\w+/
 => ["Don", "t", "be", "shy"]  

I am actually collecting words which are not articles, conjunctions, prepositions etc. So anyway I am ignoring such short forms and hence I used this solution.

I am preparing words cloud from tweets. If you know any proven algorithm please share.

Upvotes: 0

steenslag
steenslag

Reputation: 80065

From the docs:

split(pattern=$;, [limit]) → anArray

[...] If the limit parameter is omitted, trailing null fields are suppressed. If limit is a positive number, at most that number of fields will be returned (if limit is 1, the entire string is returned as the only entry in an array). If negative, there is no limit to the number of fields returned, and trailing null fields are not suppressed.

Upvotes: 1

CuriousMind
CuriousMind

Reputation: 34145

[Update]

Skip regex, just Split on space!

> "@dhh congratulations!!".split
 => ["@dhh", "congratulations"] 

\W matches any non-word character including space. so as the parser sees a space in start & some chars AFTER the space; it splits. But if the space it at the end, there is no other wordy char[a-zA-Z0-9] present to split with.

To get consistent behavior, you should remove whitespaces using #strip method.

Case 1(Trailing space)

1.9.3p327 :007 > " on behalf of all of us ".strip.split(/\W+/) 
 => ["on", "behalf", "of", "all", "of", "us"] 

Case 2(Leading space)

1.9.3p327 :008 > "on behalf of all of us ".strip.split(/\W+/) 
 => ["on", "behalf", "of", "all", "of", "us"]

Upvotes: 4

Related Questions