Reputation: 4208
Say I want select a subset of data by their values of from the same field. Right now I have to do something like this
TestLocationsResults = FILTER SalesData by (StoreId =='17'
or StoreId =='85'
or StoreId =='12'
or StoreId =='45'
or StoreId =='26'
or StoreId =='75'
or StoreId =='13'
)
in SQL, we can simply do this :
SELECT * FROM SalesData where StoreID IN (17, 12, 85, 45, 26, 75, 13)
Is there a similiar shortcut in Pig that I am missing?
Upvotes: 1
Views: 1878
Reputation: 925
It looks like Pig 0.12 added an IN
operator.
So you can do
FILTER SalesData BY StoreID IN (17, 12, 85, 45, 26, 75, 13);
Upvotes: 3
Reputation: 39009
The way you're currently doing it is the best way to do it in Pig. All of the alternatives to what you're doing now are either hacky, slow, or both. Hopefully, Pig adds an "in" query in a future version, but for now you're doing it the best way available.
Upvotes: 1
Reputation: 3619
One work around could be to use a built in function "INDEXOF" Ex:
TestLocationsResults = FILTER SalesData by INDEXOF(',17,12,85,45,26,75,13,', CONCAT(CONCAT(',', StoreId), ',')) > -1;
Amended to take into account the comment, introduce the ',' symbols around StoreId to have the exact match and not a partial
Upvotes: 1
Reputation: 5811
My solution to this, when the data type is chararray
, is to use a regular expression:
TestLocationsResults = FILTER SalesData by StoreID MATCHES '(17|12|85|45|26|75|13)';
When the data type is an int
, you could try casting to a chararray
.
Upvotes: 1
Reputation: 30089
There is no IN keyword in Pig to do this sort of set membership detection.
One suggestion if to write a UDF (as seen in this question / answer).
Another could be to create a relationship with values for each StoreId
you want to filter by and then perform an inner join on the two relationships.
Upvotes: 1