Reputation: 10159
I have two data sets (1M unique string) and (1B unique string); I want to know how many strings are common in both sets, and wondering what is the most efficient way to get the number using Apache Pig?
Upvotes: 3
Views: 963
Reputation: 11
If your data is already sorted in both the data sets you can define merged join
.
Mergede = join A by a1, B by b1 USING "merge";
Skewed Join: If the data is skewed and user need finer control over the allocation to reducers.
skewedh = join A by a1, B by b1 USING "skewed";
Upvotes: 1
Reputation: 3261
You can first join both the file like below:
A = LOAD '/joindata1.txt' AS (a1:int,a2:int,a3:int);
B = LOAD '/joindata2.txt' AS (b1:int,b2:int);
X = JOIN A BY a1, B BY b1;
Then you can count the number of rows :
grouped_records = GROUP X ALL;
count_records = FOREACH grouped_records GENERATE COUNT(A.a1);
Does it help you problem...
Upvotes: 5
Reputation: 2221
Your case doesn't fall under either replicate or merge or skewed join. So you have to do a default join, where in map phase it annotates each record's source, Join key would be used as the shuffle key so that the same join key goes to same reducer then the leftmost input is cached in memory in the reducer side and the other input is passed through to do a join. You could also improve your join by normal join optimizations like filter NULL's before joining and table which has the largest number of tuples per key could be kept as the last table in your query.
Upvotes: 1