Rory A Campbell
Rory A Campbell

Reputation: 131

How do I scan for combinations of words in ruby using regex?

I'm trying to scan a string for any combination of a list of words. Specifically, I want to find any 'number word' combinations such as "two hundred and eighty" or "fifty eight".

To do this I have made a list all the single number words up to a million:

numberWords = ["one", "two", "three", ...... "hundred", "thousand", "million"]

I then joined the list together using "|" and made a regex like this:

string.scan(/\b(#{wordList}(\s|\.|,|\?|\!))+/)

I expected this to return a list of all number word combinations but it only returns the words separately. For example, if there is "three million" in the string it returns "three" and "million" but not "three million". How do I correct this?

Upvotes: 2

Views: 559

Answers (3)

FUJI Goro
FUJI Goro

Reputation: 889

I have ported Perl's Regexp::Trie to Ruby:

This is a simple version of Regexp::Assemble but good enough to me.

Upvotes: 1

the Tin Man
the Tin Man

Reputation: 160551

Just for fun, here's a bit more interesting way to generate patterns that have to match long lists:

#!/usr/bin/env perl

use Regexp::Assemble;

my $ra = Regexp::Assemble->new;
foreach (@ARGV) {
    $ra->add($_);
}
print $ra->re, "\n";

Save that as "regexp_assemble.pl", install Perl's Regexp::Assemble module, then run:

perl ./regexp_assemble.pl one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen nineteen twenty thirty forty fifty sixty seventy eighty ninety hundred thousand million ' ' '\.' ',' '?' '!'

You should see this generated:

(?^:(?:[ !,.?]|t(?:h(?:irt(?:een|y)|ousand|ree)|w(?:e(?:lve|nty)|o)|en)|f(?:o(?:ur(?:teen)?|rty)|i(?:ft(?:een|y)|ve))|s(?:even(?:t(?:een|y))?|ix(?:t(?:een|y))?)|e(?:ight(?:een|y)?|leven)|nine(?:t(?:een|y))?|hundred|million|one))

That's Perl's version of the pattern, and it needs a few minor tweaks to meet your requirements: Remove the leading ?^: and its surrounding parenthesis, add a trailing + and, for flexibility, make it case-insensitive:

pattern = /(?:[ !,.?]|t(?:h(?:irt(?:een|y)|ousand|ree)|w(?:e(?:lve|nty)|o)|en)|f(?:o(?:ur(?:teen)?|rty)|i(?:ft(?:een|y)|ve))|s(?:even(?:t(?:een|y))?|ix(?:t(?:een|y))?)|e(?:ight(?:een|y)?|leven)|nine(?:t(?:een|y))?|hundred|million|one)+/i

Here's some scan results:

'one dollar'.scan(pattern) # => ["one "]
'one million dollars'.scan(pattern) # => ["one million "]
'one million three hundred dollars'.scan(pattern) # => ["one million three hundred "]
'one million, three hundred!'.scan(pattern) # => ["one million, three hundred!"]
'one million, three hundred and one dollars'.scan(pattern) # => ["one million, three hundred ", " one "]

Unfortunately, Ruby doesn't have the equivalent to Perl's Regexp::Assemble module. It's quite useful for this sort of task, as the regular expression engine in Ruby is very fast.

The only downside to this is it's capturing leading and trailing spaces, but that's easily fixed by using map(&:strip) on the strings:

'one million, three hundred and one dollars'.scan(pattern).map(&:strip) # => ["one million, three hundred", "one"]

Upvotes: 2

sawa
sawa

Reputation: 168101

numberWords = ["one", "two", "three", "hundred", "thousand", "million"]
numberWords = Regexp.union(numberWords)
# => /one|two|three|hundred|thousand|million/

"foo bar three million dollars"
.scan(/\b#{numberWords}(?:(?:\s+and\s+|\s+)#{numberWords})*\b/)
# => ["three million"]

Upvotes: 7

Related Questions