Reputation: 315
I need some help with Mapreduce jobs in Hadoop. I have the following problem. I have a large data set containing multiple documents + category of the document. I need to calculate the chi square value for each term in the documents per category. Which means, I need the number of occurences per term per category + number of documents per category.
My approach is to have a Mapreduce job which counts the number of occurences for each word for each category:
Input Mapper: (docId, TextOfDocument) -> ({term, category}, docID) Reducer: (term, {category,NumberOfOccurences})
The problem with this is, that I loose the information of the number of documents per category, which I would need in my next Job to calculate the chi-square value.
I thought about the following solutions:
1) Use counters per category to store the number of documents per category, when reading in the documents. I think this would be the best and easiest solution. The problem is, I don't know the number of categories, therefore I would need to increase the number of counters dynamically. I didn't find a way to do it in Hadoop (create dynamically increasing counters)? Is there a way and how would I do it?
2) First, run a job and count the number of documents per category and store it somehow. I don't know how to retrieve the data or store is somehow convienent that I can read in while reading in the whole documents.
3) Partition it somehow with extra values for the Datatypes and count it somehow.
Could anyone help me with this porblem? Which approach would be the best? Or are there other approaches? Thanks for your help!
Upvotes: 1
Views: 581
Reputation: 1031
I think finally I could find a solution to calculate your term counting per categories and number of documents per categories in one pass.
In your map phase you should extract what ever you need then your input and outputs should be something like this:
<docId, TextOfDocument> -->
1. "<C_AFFIX+category+C_AFFIX, 1>"
2. "<CT_AFFIX+category+term+CT_AFFIX, 1>"
C_AFFIX and CT_AFFIX: are just identifiers to help not the key's of these two different types get mixed with each other.
and in your reduce phase you should act just like word count classical problem and just count and sort to the output:
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
C_AFFIX and CT_AFFIX can help each output record of each type seats next to each other.
Upvotes: 2