ajduff574
ajduff574

Reputation: 2111

Do something to the entire Reducer values list based on one element

I have an interesting problem that I'm struggling to fit in MapReduce. I have a bunch of log entries. What I need to do is something like this:

Check if any entry for a given IP has a specific flag set. If it does, apply a transform to all entries with that IP, otherwise do not transform.

The simplest way to do this would be to key off of IP, then in the reducer iterate once over the values to check if any have that flag set, and again to transform (if necessary). Unfortunately, it seems I can only iterate over the Iterable passed into the reducer once.

The possible solutions I see are:

  1. In the reducer, serialize the values I'm reading to disk so that I can lazily deserialize later to iterate a second time. This seems like a bit of a hack.
  2. Run some job beforehand that generates a list of IPs to transform, and store this in HBase or something. This obviously requires HBase, and a lot of network communication.

I'd like to stick with standard MapReduce, to be able to easily run on Amazon Elastic MapReduce. I feel like there should be some way of doing this via chained jobs, but I can't seem to come up with anything. Does anyone have any tips on how I could do this?

Upvotes: 0

Views: 163

Answers (1)

bajafresh4life
bajafresh4life

Reputation: 12853

One possibility: your mappers can output a compound key which includes both the IP address and the presence of this specific flag. Then you need to make sure that the records you iterate over in the reducer are sorted such that the records where flag=true appear first. Since these records appear first, then you will know to apply your transformation to all the records in that IP address group.

Here is a blog posting that describes how to do this:

http://www.riccomini.name/Topics/DistributedComputing/Hadoop/SortByValue/

Upvotes: 2

Related Questions