Reputation: 37
I am currently using a UDF to get an output, however a regular expression will do the same and probably quicker!
I am having a problem running the code in pig, this is the line of code I am trying to run.
data = FOREACH f GENERATE FLATTEN(REGEX EXTRACT(col4,'(?:\.)([^\.]*\.?[^\.]*)$')) AS (url:chararray) ;
This line of code comes up with an error Syntax error, unexpected symbol at or near '('
The regex works by getting google.co.uk and will return .co.uk, google.com will return .com Link here: http://gskinner.com/RegExr/?372tm
My idea is then to count by the number of tlds. e.g 3 co.uk
countURL = group data by url;
result = foreach countURL generate group, COUNT($1);
If anyone can help that would be great.
Thanks
Upvotes: 0
Views: 413
Reputation: 3284
A couple of things:
_
in regex_extract
0
\\
data = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(url,'(?:\\.)([^\\.]*\\.?[^\\.]*)$', 0));
This gives .com
for google.com
Upvotes: 1