Rajiv
Rajiv

Reputation: 1

Joins in Pig with pattern match

I have 2 files file1 and file2.

The contents of file1 are: aa bb cc

The contents of file2 are: aab f2 f3 zzx f2 f3 bbc f2 f3

I would like to join file1 (on field 1) and file2 (on field 1) using Pig where the output is: aa aab f2 f3 bb bbc f2 f3

Basically, the match must be something similar to aa*, bb*, cc* etc.

Any ideas on how to go about it?

Upvotes: 0

Views: 1166

Answers (1)

reo katoa
reo katoa

Reputation: 5801

The simplest solution would be to use the CROSS operator followed by a FILTER.

input1 = LOAD 'file1' AS (f1:chararray, f2, f3);
input2 = LOAD 'file2' AS (f1:chararray, f2, f3);

crossed = CROSS input1, input2;
filtered = FILTER crossed BY INDEXOF(input2::f1, input1::f1) == 0;

INDEXOF is a built-in UDF that searches for the second string within the first one and returns the index of the first occurrence, or -1 if it does not occur. Since you want the second string to begin with the first one, you are looking for an index of 0.

See the "Cross" section of the "Advanced Pig Latin" chapter of the excellent book Programming Pig. In particular note the warning about CROSS generating large amounts of data. If you have large inputs, you may wish to formulate an application-specific way of constructing a join key so that you do not require fuzzy matching.

Upvotes: 2

Related Questions