Andy
Andy

Reputation: 1139

Ruby Regular expression too big / Multiple string match

I have 1,000,000 strings that I want to categorize. The way I do this is to bucket it if it contains a set of words or phrases. The set of words is about 10,000. Ideally I would be able to support regular expressions, but I am focused on making it run fast right now. Example phrases:

ford, porsche, mazda...

I really dont want to match each word against the strings one by one, so I decided to use regular expressions. Unfortunately, I am running into a regular expression issue:

Regexp.new("(a)"*253) => /(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)...

Regexp.new("(a)"*254) RegexpError: regular expression too big: /(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)...

where a would be one of my words or phrases. Right now, I am planning on running 10,000 / 253 matches. I read that the length of the regex heavily impacts performance, but my regex match is really simple and the regexp is created very quickly. I would like to get around the limitation somehow, or use a better solution if anyone has any ideas. Thanks.

Upvotes: 2

Views: 720

Answers (1)

walrii
walrii

Reputation: 3522

You might consider other mechanisms for recognizing 10k words.

  • Trie: Sometimes called a prefix tree, it is often used by spell checkers for doing word lookups. See Trie on wikipedia
  • DFA (deterministic finite automata): A DFA is often created by the lexer in a compiler for recognizing the tokens of the language. A DFA runs very quickly. Simple regexes are often compiled into DFAs. See DFA on wikipedia

Upvotes: 1

Related Questions