Reputation: 42100
Suppose there are N
sets of words and I would like to create a map from those sets so that it maps the words to the number of the words occurrences in all these sets.
For example:
N = 3 S1 = {"a", "b", "c"}, S2 = {"a", "b", "d"}, S3 = {"a", "c", "e"} M = { "a" -> 3, "b" -> 2, "c" -> 2, "d" -> 1, "e" -> 1}
Now I have M
computers to use. Thus, I can make each computer create a map from N/M
sets. In the second (final) phase I can create a map from the M
maps. Looks like a map/reduce
. Does it make sense ? How would you improve this approach ?
Upvotes: 1
Views: 55
Reputation: 33509
This is the standard map reduce example.
For example here is Python code based on the mincemeat map/reduce library:
#!/usr/bin/env python
import mincemeat
S1 = {"a", "b", "c"}
S2 = {"a", "b", "d"}
S3 = {"a", "c", "e"}
datasource = dict(enumerate([S1,S2,S3]))
def mapfn(k, v):
for w in v:
yield w, 1
def reducefn(k, vs):
result = sum(vs)
return result
s = mincemeat.Server()
s.datasource = datasource
s.mapfn = mapfn
s.reducefn = reducefn
results = s.run_server(password="changeme")
print results
Prints
{'a': 3, 'c': 2, 'b': 2, 'e': 1, 'd': 1}
Note that the way that map/reduce is structured means that the server gives new tasks to clients as they complete their tasks.
This means that there is not necessarily a fixed partitioning of N/M tasks to each client.
If one client is faster than the others then it will end up being given more tasks in order to make best use of the available resources.
Upvotes: 2