Reputation: 83
I am parsing some data of the form:
(['L123', 'L234', 'L1', 'L253764'])
(['L23', 'L2'])
(['L5'])
...
where the phrases inside the parens, including the brackets, are encoded as a single chararray.
I want to extract just the L+(digits) tags to obtain tuples of the form:
((L123, L234, L1, L253764))
((L23, L2))
((L5))
I have tried using REGEX_EXTRACT_ALL using the regular expression '(L\d+)', but it only seems to extract a single tag per line, which is useless to me. Is there a way to create tuples in the way I have described above?
Upvotes: 3
Views: 5035
Reputation: 5186
If order does not matter, then this will work:
-- foo is the tuple, and bar is the name of the chararray
B = FOREACH A GENERATE TOKENIZE(foo.bar, ',') AS values: {T: (value: chararray)} ;
C = FOREACH B {
clean_values = FOREACH values GENERATE
REGEX_EXTRACT(value, '(L[0-9]+)', 1) AS clean_value: chararray ;
GENERATE clean_values ;
}
The schema and output are:
C: {clean_values: {T: (clean_value: chararray)}}
({(L123),(L234),(L1),(L253764)})
({(L23),(L2)})
({(L5)})
Generally, if you don't know how many elements the array will have then a bag will be better.
Upvotes: 2