user2931635
user2931635

Reputation: 37

Regex within a for each loop in pig

I am currently using a UDF to get an output, however a regular expression will do the same and probably quicker!

I am having a problem running the code in pig, this is the line of code I am trying to run.

data = FOREACH f GENERATE FLATTEN(REGEX EXTRACT(col4,'(?:\.)([^\.]*\.?[^\.]*)$')) AS (url:chararray) ;

This line of code comes up with an error Syntax error, unexpected symbol at or near '('

The regex works by getting google.co.uk and will return .co.uk, google.com will return .com Link here: http://gskinner.com/RegExr/?372tm

My idea is then to count by the number of tlds. e.g 3 co.uk

 countURL = group data by url;
 result = foreach countURL generate group, COUNT($1);

If anyone can help that would be great.

Thanks

Upvotes: 0

Views: 413

Answers (1)

Frederic
Frederic

Reputation: 3284

A couple of things:

  • You are missing the _ in regex_extract
  • You need to specify the group 0
  • The dots needs to be double quoted \\

data = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(url,'(?:\\.)([^\\.]*\\.?[^\\.]*)$', 0));

This gives .com for google.com

Upvotes: 1

Related Questions