Reputation: 123
I have an index with all documentations of our products. The documentfields are:
Because most of our documentations have several sites I create for each site one document in the index. So when I am searching for a product by group, name and version I get a few results. But sometime I want for this combination (group, name and version) only one result (regardless how many documents exists for the product).
Therefor I used the DuplicateFilter:
Because this filter can only be used on one field (and not on fieldcombinations) I created another field (productkey). In this field I stored an id for this product (md5Hashvalue of the combination of group, name and version fields). Then I told the DuplicateFilter to use this field to filter duplicates.
But now I got not all the expected searchresults. i.e:
Documents:
group | name | version | productkey | description
a | one | 1.0 | 808d8f96138b7dec7cc69c2769176424 | ...
a | two | 1.0 | 0225635fc76ed8b88c65c7eb9f2ec1f9 | ...
a | two | 1.0 | 0225635fc76ed8b88c65c7eb9f2ec1f9 | ...
a | three| 1.0 | 621e2597b189ee8d9448f6bfb26c5a8f | ...
a | three| 1.0 | 621e2597b189ee8d9448f6bfb26c5a8f | ...
a | three| 1.0 | 621e2597b189ee8d9448f6bfb26c5a8f | ...
a | three| 1.0 | 621e2597b189ee8d9448f6bfb26c5a8f | ...
a | three| 1.0 | 621e2597b189ee8d9448f6bfb26c5a8f | ...
a | four | 1.0 | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a | four | 1.0 | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a | four | 1.0 | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a | four | 1.0 | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a | four | 1.0 | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a | four | 1.0 | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a | five | 1.0 | b2d49bc320325007e1466a38e41ce69a | ...
a | five | 1.0 | b2d49bc320325007e1466a38e41ce69a | ...
a | five | 1.0 | b2d49bc320325007e1466a38e41ce69a | ...
a | five | 1.0 | b2d49bc320325007e1466a38e41ce69a | ...
a | five | 1.0 | b2d49bc320325007e1466a38e41ce69a | ...
zz | one | 1.0 | b610a470c9a7d2cc928725e1fb1a577a | ...
zz | one | 1.0 | b610a470c9a7d2cc928725e1fb1a577a | ...
zz | one | 1.0 | b610a470c9a7d2cc928725e1fb1a577a | ...
zz | two | 1.0 | f5bb84453af30dd5f229d04cdb787dec | ...
zz | three| 1.0 | 4b86d91feded953e57fb3d1ccbf0fc6e | ...
zz | three| 1.0 | 4b86d91feded953e57fb3d1ccbf0fc6e | ...
zz | three| 1.0 | 4b86d91feded953e57fb3d1ccbf0fc6e | ...
Results:
group | name | version | productkey
a | two | 1.0 | 0225635fc76ed8b88c65c7eb9f2ec1f9
a | three| 1.0 | 621e2597b189ee8d9448f6bfb26c5a8f
zz | two | 1.0 | f5bb84453af30dd5f229d04cdb787dec
so I am missing these products:
group | name | version | productkey
a | one | 1.0 | 808d8f96138b7dec7cc69c2769176424
a | four | 1.0 | 3d03056a0d0f29f63477ee1f130b7ae8
a | five | 1.0 | b2d49bc320325007e1466a38e41ce69a
zz | one | 1.0 | b610a470c9a7d2cc928725e1fb1a577a
zz | three| 1.0 | 4b86d91feded953e57fb3d1ccbf0fc6e
Here is my code to instantiate the filter:
DuplicateFilter filter = new DuplicateFilter("productkey");
filter.setKeepMode(DuplicateFilter.KM_USE_FIRST_OCCURRENCE);
filter.setProcessingMode(DuplicateFilter.PM_FULL_VALIDATION);
Did I make a mistake or is it a bug in the duplicateFilter (maybe to long fieldvalues, etc.)?
I am using Lucene 3.6.
Upvotes: 0
Views: 1521
Reputation: 5805
Yes this won't work this way. It's because the filter "cleans" up all indexdocuments befor the searchquery get the matching documents.
i.e. in your index are the folowing documents:
docId, value
1, a
1, b
1, c
2, c
4, a
5, d
and you have a special filter which filters duplicats by id, than you have only the following documents for the search:
docId, value
1, a
4, a
2, c
5, d
and only after this your search will be runed. i.e. if your searching for all "c"s, you will get only
2, c
even when there are 2 "c"s with different ids in the index.
So your combination woun't work this way.
Upvotes: 1