drio
drio

Reputation: 51

expressing a grep like algorithm in mapreduce terms for a very long list of keywords

I am having trouble expressing an algorightm in mapreduce terms.

I have two big input text files: Let's call the first file "R" and the second one "P". R is typically much bigger than P, but both are big.

In a non-mapreduce approach, the contents of P would be loaded into memory (hashed) and then we would start iterating over all the lines in R. The lines in R are just strings, and we want to check if any of the substrings in R match any string in P.

The problem is very similar to grepping words in a bigfile, the issue is that the list of words is very large so you cannot hardcode them in your map routine.

The problem I am encountering is that I don't know how to ensure that all the splits of the P file end up in a map job per each split of the R file. So, assuming these splits:

R = R1, R2, R3;
P = P1, P2

The 6 map jobs have to contain these splits:

(R1, P1) (R1, P2);
(R2, P1) (R2, P2);
(R3, P1) (R3, P2);

How would you express this problem in mapreduce terms?

Thanks.

Upvotes: 1

Views: 831

Answers (2)

drio
drio

Reputation: 51

I have spent some time working on this and I have come up with a couple of solutions. The first one is based on hadoop streaming and the second one uses native java.

For the first solution I use an interesting feature from ruby. If you add the keyword __END__ at the end of your code, all the text after that will be exposed by the interpreter via the global variable DATA. This variable is a File object. Example:

$ cat /tmp/foo.rb
puts DATA.read

__END__
Hello World!
$ ruby /tmp/foo.rb
Hello World!

We will use the file R as a input (It will be distributed across the HDFS filesytem). We iterate over the P file and after traversing a certain number of lines, we add those at the end of our mapper script. Then, we submit the job to the hadoop cluster. We keep iterating over the contents of P until we have consumed all the lines. Multiple jobs will be sent to the cluster based on the number of lines per job and the size of P.

That's a fine approach that I have implemented and it works quite well. I don't find particularly elegant though. We can do better by writing a native mapreduce app in java.

When using a native java app, we have a full access to the hadoop HDFS API. That means we can read the contents of a file from our code. That's something I don't think it is available when streaming.

We follow an approach similar to the streaming method, but once we have traversed a certain number of lines, we send those to the hadoop cluster instead of append it to the code. We can do that within the code that schedules our jobs.

Then, it is a matter of running as many jobs as the number of splits that we have for P. All the mappers in a particular job will load a certain split and will use it to compute the splits of R.

Upvotes: 1

Praveen Sripati
Praveen Sripati

Reputation: 33495

Nice problem.

One quick way I can think of is to split the P file into multiple files and run multiple MR jobs with each split of the P file and the complete R file as input.

Upvotes: 0

Related Questions