jet
jet

Reputation: 123

Lucene DuplicateFilter filters not only duplicate results

I have an index with all documentations of our products. The documentfields are:

Because most of our documentations have several sites I create for each site one document in the index. So when I am searching for a product by group, name and version I get a few results. But sometime I want for this combination (group, name and version) only one result (regardless how many documents exists for the product).

Therefor I used the DuplicateFilter:

Because this filter can only be used on one field (and not on fieldcombinations) I created another field (productkey). In this field I stored an id for this product (md5Hashvalue of the combination of group, name and version fields). Then I told the DuplicateFilter to use this field to filter duplicates.

But now I got not all the expected searchresults. i.e:

Documents:

group | name | version | productkey | description
a     | one  | 1.0     | 808d8f96138b7dec7cc69c2769176424 | ...
a     | two  | 1.0     | 0225635fc76ed8b88c65c7eb9f2ec1f9 | ...
a     | two  | 1.0     | 0225635fc76ed8b88c65c7eb9f2ec1f9 | ...
a     | three| 1.0     | 621e2597b189ee8d9448f6bfb26c5a8f | ...
a     | three| 1.0     | 621e2597b189ee8d9448f6bfb26c5a8f | ...
a     | three| 1.0     | 621e2597b189ee8d9448f6bfb26c5a8f | ...
a     | three| 1.0     | 621e2597b189ee8d9448f6bfb26c5a8f | ...
a     | three| 1.0     | 621e2597b189ee8d9448f6bfb26c5a8f | ...
a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a     | five | 1.0     | b2d49bc320325007e1466a38e41ce69a | ...
a     | five | 1.0     | b2d49bc320325007e1466a38e41ce69a | ...
a     | five | 1.0     | b2d49bc320325007e1466a38e41ce69a | ...
a     | five | 1.0     | b2d49bc320325007e1466a38e41ce69a | ...
a     | five | 1.0     | b2d49bc320325007e1466a38e41ce69a | ...
zz    | one  | 1.0     | b610a470c9a7d2cc928725e1fb1a577a | ...
zz    | one  | 1.0     | b610a470c9a7d2cc928725e1fb1a577a | ...
zz    | one  | 1.0     | b610a470c9a7d2cc928725e1fb1a577a | ...
zz    | two  | 1.0     | f5bb84453af30dd5f229d04cdb787dec | ...
zz    | three| 1.0     | 4b86d91feded953e57fb3d1ccbf0fc6e | ...
zz    | three| 1.0     | 4b86d91feded953e57fb3d1ccbf0fc6e | ...
zz    | three| 1.0     | 4b86d91feded953e57fb3d1ccbf0fc6e | ...

Results:

group | name | version | productkey
a     | two  | 1.0     | 0225635fc76ed8b88c65c7eb9f2ec1f9
a     | three| 1.0     | 621e2597b189ee8d9448f6bfb26c5a8f
zz    | two  | 1.0     | f5bb84453af30dd5f229d04cdb787dec

so I am missing these products:

group | name | version | productkey
a     | one  | 1.0     | 808d8f96138b7dec7cc69c2769176424
a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8
a     | five | 1.0     | b2d49bc320325007e1466a38e41ce69a
zz    | one  | 1.0     | b610a470c9a7d2cc928725e1fb1a577a
zz    | three| 1.0     | 4b86d91feded953e57fb3d1ccbf0fc6e

Here is my code to instantiate the filter:

DuplicateFilter filter = new DuplicateFilter("productkey");
filter.setKeepMode(DuplicateFilter.KM_USE_FIRST_OCCURRENCE);
filter.setProcessingMode(DuplicateFilter.PM_FULL_VALIDATION);

Did I make a mistake or is it a bug in the duplicateFilter (maybe to long fieldvalues, etc.)?

I am using Lucene 3.6.

Upvotes: 0

Views: 1521

Answers (1)

chresse
chresse

Reputation: 5805

Yes this won't work this way. It's because the filter "cleans" up all indexdocuments befor the searchquery get the matching documents.

i.e. in your index are the folowing documents:

docId, value
1, a
1, b
1, c
2, c
4, a
5, d

and you have a special filter which filters duplicats by id, than you have only the following documents for the search:

docId, value
1, a
4, a
2, c
5, d

and only after this your search will be runed. i.e. if your searching for all "c"s, you will get only

2, c

even when there are 2 "c"s with different ids in the index.

So your combination woun't work this way.

Upvotes: 1

Related Questions