Creating Map from a number of Sets with Map/Reduce

Question

Suppose there are N sets of words and I would like to create a map from those sets so that it maps the words to the number of the words occurrences in all these sets.

For example:

N = 3
S1 = {"a", "b", "c"}, S2 = {"a", "b", "d"}, S3 = {"a", "c", "e"}
M = { "a" -> 3, "b" -> 2, "c" -> 2, "d" -> 1, "e" -> 1}

Now I have M computers to use. Thus, I can make each computer create a map from N/M sets. In the second (final) phase I can create a map from the M maps. Looks like a map/reduce. Does it make sense ? How would you improve this approach ?

Peter de Rivaz · Accepted Answer

This is the standard map reduce example.

For example here is Python code based on the mincemeat map/reduce library:

#!/usr/bin/env python
import mincemeat

S1 = {"a", "b", "c"}
S2 = {"a", "b", "d"}
S3 = {"a", "c", "e"}

datasource = dict(enumerate([S1,S2,S3]))

def mapfn(k, v):
    for w in v:
        yield w, 1

def reducefn(k, vs):
    result = sum(vs)
    return result

s = mincemeat.Server()
s.datasource = datasource
s.mapfn = mapfn
s.reducefn = reducefn

results = s.run_server(password="changeme")
print results

Prints

{'a': 3, 'c': 2, 'b': 2, 'e': 1, 'd': 1}

Note that the way that map/reduce is structured means that the server gives new tasks to clients as they complete their tasks.

This means that there is not necessarily a fixed partitioning of N/M tasks to each client.

If one client is faster than the others then it will end up being given more tasks in order to make best use of the available resources.

Creating Map from a number of Sets with Map/Reduce

Answers (1)

Related Questions