Reputation: 34406
I've been reading about map/reduce so that I can improve my understanding of big data processing techniques, but I'm a little unclear on a couple of things:
Isn't the map
function going to still be slow when operating on huge data sets even with, say, 10 workers/threads/machines/cpus/whatever? For example, if the data set is a billion records, that means every worker still needs to iterate over 100 million records, and that transformation still needs to be stored somewhere for processing.
How do indexes on the data factor into a map/reduce scenario (if at all)?
As a bonus question, what I'm trying to do is produce a real-time (<100ms response time) search solution over a data set that has in the region of 20-50 million records and where the results can be ordered on 1-3 fields and queried on around 20-30 different fields with nested, grouped AND/OR queries. Is map/reduce probably the best approach for what I'm doing?
Upvotes: 0
Views: 176
Reputation: 43094
The map function is going to extract the subset of data (in the final output format) that the reduce function is going to execute against. As map is the extract it's reasonable to expect that indexing will be a major factor in the speed of execution. Any time you are looking at billions of records then there will need to be appropriate optimisations and suitable platforms in order to keep them timely.
The output from the map function will need to be stored somewhere ready for reduce to operate, that's unavoidable.
Map/Reduce provides you with an opportunity to segment your search into smaller, more manageable chunks so it is appropriate to your task. Bear in mind that unless you are using multiple systems, simply adding threads to the task can be counter productive as it will increase the context switching necessary to service them all. I wouldn't assign more threads for each system than the number of physical cores, also be prepared for delays while the threads contend for disk access or NIC access.
You've quite a task ahead of you, I would look to see how others have implemented such systems and see if I couldn't reuse one of those rather than trying to do this yourself. If it's an intellectual exercise then I hope you'll share the trials, tribulations and outcomes in a blog post somewhere.
Upvotes: 1