johnothegrouch
johnothegrouch

Reputation: 83

Hadoop Pig: Extract all substrings matching a given regular expression

I am parsing some data of the form:

(['L123', 'L234', 'L1', 'L253764'])
(['L23', 'L2'])
(['L5'])
...

where the phrases inside the parens, including the brackets, are encoded as a single chararray.

I want to extract just the L+(digits) tags to obtain tuples of the form:

((L123, L234, L1, L253764))
((L23, L2))
((L5))

I have tried using REGEX_EXTRACT_ALL using the regular expression '(L\d+)', but it only seems to extract a single tag per line, which is useless to me. Is there a way to create tuples in the way I have described above?

Upvotes: 3

Views: 5035

Answers (1)

mr2ert
mr2ert

Reputation: 5186

If order does not matter, then this will work:

-- foo is the tuple, and bar is the name of the chararray
B = FOREACH A GENERATE TOKENIZE(foo.bar, ',') AS values: {T: (value: chararray)} ; 
C = FOREACH B {
    clean_values = FOREACH values GENERATE  
                   REGEX_EXTRACT(value, '(L[0-9]+)', 1) AS clean_value: chararray ; 
    GENERATE clean_values ;
} 

The schema and output are:

C: {clean_values: {T: (clean_value: chararray)}}
({(L123),(L234),(L1),(L253764)})
({(L23),(L2)})
({(L5)})

Generally, if you don't know how many elements the array will have then a bag will be better.

Upvotes: 2

Related Questions