Reputation: 101
I have the below data.
• PRT_Edit & Set Shopping Cart in Retail
• PRT_Confirm Shopping Cart for Goods
o PRT-Ret_Process Supplier Invoice
o PRT-Web_Overview of Orders
o PRT_Update Outfirst Agreement
PRT_Axn_-Purchase and Requisition
The data has special symbols, tab space and spaces. I want to extract only the text part from this data as:
PRT_Edit & Set Shopping Cart in Retail
PRT_Confirm Shopping Cart for Goods
PRT-Ret_Process Supplier Invoice
PRT-Web_Overview of Orders
PRT_Update Outfirst Agreement
I have tried using REGEX_EXTRACT_ALL in Pig Script as below but it does not work.
PRT = LOAD '/DATA' USING TEXTLOADER() AS (LINE:CHARARRAY);
Cleansed = FOREACH PRT GENERATE REGEX_EXTRACT_ALL(LINE,'[A-Z]*') AS DATA;
When I try dumping Cleansed, it does not show any data. Can any one please help.
Upvotes: 2
Views: 379
Reputation: 626748
You can use
Cleansed = FOREACH PRT GENERATE FLATTEN(
REGEX_EXTRACT_ALL(LINE, '^[^a-zA-Z]*([a-zA-Z].*[a-zA-Z])[^a-zA-Z]*$'))
AS (FIELD1:chararray), LINE;
The regex matches the following:
^
- start of string[^a-zA-Z]*
- 0 or more characters other than the Latin letters in the character class([a-zA-Z].*[a-zA-Z])
- a capturing group that we'll reference to as FIELD1
later, matching:
[a-zA-Z].*[a-zA-Z]
- a Latin letter, then any characters, as many as possible (the greedy *
is used, not *?
lazy one)[^a-zA-Z]*
- 0 or more characters other than the Latin letters$
- end of stringUpvotes: 1