Reputation: 21
I have a java program, that reads all words of a PDF file. I saved the words with the pagenumbers in a database (couchDB). Now I want to write a map and a reduce function, which list each word with the pagenumbers where the word occurs, but if words occur more than once on a page, I want just one entry. The result should be a row with word and a second row with a list (String separated with comma) of pagenumbers. Each word with the pagenumber is a separate document in couchDB. How can I do this with a map-reduce function (filter same entries of pagenumbers)? Thanks for help.
Upvotes: 2
Views: 1620
Reputation: 1792
Surely there is more than one way of doing it. I'd go for something simple. Lets say your documents look somewhat like this:
{ 'type': 'word-index', 'word': 'Great', 'page_number': 45 }
This is a result of finding the word 'Great' on page 45. Now your view index is created by a view function:
function map(doc) {
if (doc.type == 'word-index') {
emit([doc.word, doc.page_number], null);
}
}
For reduce part just use the "_count" builtin.
Now to get the list of all the occurrences of word "Great" in your book, just query your view with startkey=["Great"] and endkey=["Great", {}]. Now the result would look somewhat like:
["Great", 45], 4
["Great", 70], 7
Which means that world "Great" appeared 4 times on page 45 and 7 times on page 70. You can extract your comma separated list you needed from it. The number of occurrences is a bonus.
--edit--
You also have to use group_level=2 in your query. If you don't the result of the query would simply be a single row with the count off all the documents you have.
Upvotes: 4