Reputation: 8653
This is what I want to do
INPUT
1,code=1a_asdfasdf_code=1b,asdf
2,code=2a_asdfasdf_code=2b_code=2c_laksjdf;lksjdf,asdf
3,code=3a_,sdoliclwmd
Intermediate
1,{1a,1b}
2,{2a,2b,2c}
3,{3a}
Finally
1,1a
1,1b
2,2a
2,2b
I know of REGEX_EXTRACT and REGEX_EXTRACT_ALL, but none of them gives multiple matches for the same regex.
2,2c
3,3a
This is giving me only the first match
A = LOAD '/data/regsearch1.csv' using PigStorage(',') as (c1:chararray,c2:chararray,c3:chararray);
B = foreach A generate c1,REGEX_EXTRACT_ALL(c2,'.*code=([^_]+)_.*') as m1;
Upvotes: 2
Views: 1850
Reputation: 11
This can be achieved by simple string manipulation.
A = LOAD 'Data.txt' Using PigStorage(',') AS (a1:int,a2:chararray,a3:chararray);
B = foreach A generate a1, REPLACE(a2,'asdfasdf_','') AS a2;
C = FOREACH B GENERATE a1, FLATTEN(TOKENIZE(a2, '_')) AS parameter;
D = FILTER C BY INDEXOF(parameter, 'code=') != -1;
E = FOREACH D GENERATE a1, SUBSTRING(parameter, 5, 7) AS number;`
Upvotes: 0
Reputation: 8653
Just FYI this question was about PIG-latin.
I ended up writing python UDF
#!/usr/bin/python
import re;
@outputSchema("bag1:bag{tuple1:tuple(match:chararray)}")
def findallregex(pattern,str):
outbag = []
matches = re.findall(pattern,str);
for m in matches:
tuple1 = (m,)
outbag.append(tuple1);
return outbag;
And then this PIG latin code
REGISTER '/findall.py' using org.apache.pig.scripting.jython.JythonScriptEngine as myfuncs;
A = LOAD '/regsearch1.csv' using PigStorage(',') as (c1:chararray,c2:chararray,c3:chararray);
B = foreach A generate c1, myfuncs.findallregex('code=([^_]+)',c2) as bag1;
C = foreach B generate c1, flatten(bag1);
Upvotes: 3
Reputation: 45
You have to use groups, i don't know if you need process that a lot but you can pull the first digit and process the pattern of your string.
input
1,code=1a_asdfasdf_code=1b,asdf
2,code=2a_asdfasdf_code=2b_code=2c_laksjdf;lksjdf,asdf
3,code=3a_,sdoliclwmd
output
1,1a
1,1b
2,2a
2,2b
2,2c
3,3a
private static void lineProcess(String text) {
Pattern p = Pattern.compile("code=(\\w\\w)", Pattern.DOTALL);
Matcher m = p.matcher(text);
while (m.find()) {
System.out.println(text.substring(0,1)+","+m.group(1));
}
}
Upvotes: 0