SubSevn
SubSevn

Reputation: 1028

Hadoop/Pig regular expression matching

This is kind of an odd situation, but I'm looking for a way to filter using something like MATCHES but on a list of unknown patterns (of unknown length).

That is, if the given input is two files, one with numbers A:

xxxx

yyyy

zzzz

zzyy

...etc...

And the other with patterns B:

xx.*

yyy.*

...etc...

How can I filter the first input, by all of the patterns in the second?

If I knew all the patterns beforehand, I could A = FILTER A BY (num MATCHES 'somepattern.*' OR num MATCHES 'someotherpattern'....);

The problem is that I don't know them beforehand, and since they're patterns and not simple strings, I cannot just use joins/groups (at least as far as I can tell). Maybe a strange nested FOREACH...thing? Any ideas at all?

Upvotes: 1

Views: 3854

Answers (1)

QuinnG
QuinnG

Reputation: 6424

If you use the | which operates as an OR you can construct a pattern out of the individual patterns.

(xx.*|yyy.*|zzzz.*)

This will do a check to see if it matches any of the patterns.

Edit: To create the combined regex pattern:
* Create a string starting with (
* Read in each line (assuming each line is a pattern) and append it to a string followed by a |
* When done reading lines, remove the last character (which will be an unneeded |)
* Append a )

This will create a regex pattern to check all the patterns in the input file. (Note: It's assumed the file contains valid patterns)

Upvotes: 3

Related Questions