Reputation: 2890
First, sorry for my English and the confusing description in the title.
My problem here is I have multiple lines of natural phrases, I want to count the words contained in it. I have came up with the following regex in Perl:
my @words = split /[ :,.;\s\/\t!"\n]+/, $_;
It works fine except that when encounter with a word like 'U.S.A' it breaks the word into U,S and A, which is undesired. What can I do to fix it? Thanks.
Upvotes: 0
Views: 63
Reputation: 35208
I'd split based off spaces, but then remove any non-word characters from the beginning and end of the "words". That way U.S.A.
would end up as U.S.A
use strict;
use warnings;
local $_ = 'hello world, U.S.A., and other places.';
my @words = map { s/^\W+|\W+$//g; $_ } split /\s+/, $_;
use Data::Dump;
dd \@words;
Outputs
["hello", "world", "U.S.A", "and", "other", "places"]
Upvotes: 1