Ajeet Ganga
Ajeet Ganga

Reputation: 8653

Extract multiple regex matches from same line in PIG

This is what I want to do

INPUT 

    1,code=1a_asdfasdf_code=1b,asdf
    2,code=2a_asdfasdf_code=2b_code=2c_laksjdf;lksjdf,asdf
    3,code=3a_,sdoliclwmd

Intermediate 

    1,{1a,1b}
    2,{2a,2b,2c}
    3,{3a}


Finally
    1,1a
    1,1b
    2,2a
    2,2b

I know of REGEX_EXTRACT and REGEX_EXTRACT_ALL, but none of them gives multiple matches for the same regex.

2,2c
3,3a

This is giving me only the first match

A = LOAD '/data/regsearch1.csv' using PigStorage(',') as (c1:chararray,c2:chararray,c3:chararray);

B = foreach A  generate c1,REGEX_EXTRACT_ALL(c2,'.*code=([^_]+)_.*') as m1;

Upvotes: 2

Views: 1850

Answers (3)

Ashish
Ashish

Reputation: 11

This can be achieved by simple string manipulation.

    A = LOAD 'Data.txt' Using PigStorage(',') AS (a1:int,a2:chararray,a3:chararray);
    B = foreach A generate a1, REPLACE(a2,'asdfasdf_','') AS a2;
    C = FOREACH B GENERATE a1, FLATTEN(TOKENIZE(a2, '_')) AS parameter;
    D = FILTER C BY INDEXOF(parameter, 'code=') != -1;
    E = FOREACH D GENERATE a1, SUBSTRING(parameter, 5, 7) AS number;`

Upvotes: 0

Ajeet Ganga
Ajeet Ganga

Reputation: 8653

Just FYI this question was about PIG-latin.

I ended up writing python UDF

#!/usr/bin/python
import re;

@outputSchema("bag1:bag{tuple1:tuple(match:chararray)}")
def findallregex(pattern,str):
        outbag = []
        matches =  re.findall(pattern,str);
        for m in matches:
                tuple1 = (m,)
                outbag.append(tuple1);
        return outbag;

And then this PIG latin code

REGISTER '/findall.py' using org.apache.pig.scripting.jython.JythonScriptEngine as myfuncs;
A = LOAD '/regsearch1.csv' using PigStorage(',') as (c1:chararray,c2:chararray,c3:chararray);
B = foreach A generate c1, myfuncs.findallregex('code=([^_]+)',c2) as bag1;
C = foreach B generate c1, flatten(bag1);

Upvotes: 3

Murillo Maia
Murillo Maia

Reputation: 45

You have to use groups, i don't know if you need process that a lot but you can pull the first digit and process the pattern of your string.

input
    1,code=1a_asdfasdf_code=1b,asdf
    2,code=2a_asdfasdf_code=2b_code=2c_laksjdf;lksjdf,asdf
    3,code=3a_,sdoliclwmd 

output

    1,1a
    1,1b
    2,2a
    2,2b
    2,2c
    3,3a

private static void lineProcess(String text) {

        Pattern p = Pattern.compile("code=(\\w\\w)", Pattern.DOTALL);
        Matcher m = p.matcher(text); 
        while (m.find()) {
            System.out.println(text.substring(0,1)+","+m.group(1));
        }
    }

Upvotes: 0

Related Questions