waigani
waigani

Reputation: 3580

Hadoop Pig Latin regex match

I have the following Pig Latin filter:

filtered = FILTER raw BY year >= 1960 AND string MATCHES '(?!.*[0-9].*|.{1}|.*@.*|.*www.*|.*http.*)';

I was intending to get the following results for the following strings:

a #false .{1}
[email protected] #false .*@.*
http://somesite.com #false .*http.*
www.somesite.com #false .*www.*
12word #false .*[0-9].*
wo12rd #false .*[0-9].*
word12 #false .*[0-9].*
red #true

Instead, I get an empty result set.

EDIT: I've updated the regex to:

'^(?!.*[0-9].*|.{1}|.*@.*|.*www.*|.*http.*)$'

after m.buettner's correction, but continue to get an empty result set.

Upvotes: 1

Views: 5083

Answers (1)

Martin Ender
Martin Ender

Reputation: 44259

There are two problems. Firstly it seems like Pig Latin requires you to match the full string instead of "just a match somewhere within the string". But you negative lookahead does not consume any characters, so it does not match the full the string. This could simply be resolved by appending .*. Secondly your rule .{1} (where {1} is redundant) does not require this one character to be the only character in the string. So in your last example, it will simply consume the r of red and set off the negative lookahead.

Thus, here is the solution:

(?!.*[0-9]|.$|.*@|.*www|.*http).*

Upvotes: 1

Related Questions