Reputation: 554
Can i have threads inside map function? I have a task were having threads could really help me. I need to concurrently add values to a hashmap for every input line. My input line becomes an array of string and for every value of this array, i need to add it to the hash map. I later use this hashmap in the cleanup function.
I am doing this with a for loop and it seems that is the bottleneck for my project. So i thought of using a concurrent hash map and splitting the array of strings into several smaller arrays. So every thread would be responsible for adding the corresponding "smaller" array inside the hashmap. The thing is that i have implement it in a local java application and it works. When i use it inside hadoop, the results are not the ones expected. I am using Thread.join() for every thread so that for every line of input i make sure that the threads have finished before the next line. Well that is what i thought i did. Does hadoop treats threads with a special way?
edits for duffymo
Here is the google citation http://research.google.com/pubs/pub36296.html .
Algorithm 2 is the part that i am talking about. As you can see there is a for loop for every attribute and that for every attribute i need to update the in memory structure. They only have to predict one value in their approach (single label learning), were in mine i might have many values to predict (multi label learning). So what google says y value, for them is a 3 value array. For me it might be up to thousands. Aggregating two 3-dimension vectors is a lot faster than aggregating two 10000-dimension vectors.
If i put only a single label in my algorithm, i have no problem at all. The 45 seconds i mentioned, are reduced to less than 5. So yes, it is working correctly for one label only.
The 45 seconds i mentioned are for the for-loop only. I didn't count the parsing and all the other things. The for loop is the bottleneck for sure since this is the only thing that i am timing and it takes about 45 seconds, while the whole task takes about 1 minute (including task initialization and many more). I want to try and brake that for-loop into 2 or 3 smaller for loops and process them concurrently. Trying means that it might work and that it might not work. Sometimes crazy stuff like the one i mentioned, could be a necessity. Well that is what a well respected programmer told me in a previous thread of mine about hadoop.
I didn't provide these many details earlier, since i thought that i only wanted an opinion about hadoop and threads inside the map function. Didn't think that someone would question me so much :P.
Upvotes: 3
Views: 3698
Reputation: 8088
Hadoop, by itself is built to do parallelism. But it is doing it in very coarse grained manner. Hadoop parallelism is good when dataset is big, and can be divided into many subsets which are processed separately and independently (here I am referring to the Map stage only, for simplicity), for example -to search one pattern in the text.
Now, lets consider the following case : We have a lot of data, and we want to search 1000's of different patterns in this text. Now we have two choices to utilize our multi-core CPUs.
1. Process each file using separate mapper in a single thread, and have several mappers per node
2. Define one mapper per node and process one file by all cores.
The second way might be much more cache friendly, and thereof to be more efficient.
In a bottom line - for cases when fine-grained, multi-core friendly parallelism is justified by the nature of processing - usage of multi-threading within mapper can benefit us.
Upvotes: 4
Reputation: 308733
You shouldn't need threads if I understand Hadoop and map/reduce properly.
What makes you think parsing a single line of input is a bottleneck in your project? Does it merely seem to you that it's an issue or do you have data to prove it?
UPDATE: Thank you for the citation. It's obviously something that will have to be digested by me and others, so I won't have any snappy advice in the short term. But I appreciate the citation and your patience very much.
Upvotes: 3