Reputation: 63
We have about 9k documents indexed using Haystack 1.2.7 with Whoosh 2.4.1 as backend. Despite of using Haystack, it looks like a Whoosh problem. Take a look at my debug cases:
1) If I just run an exact lookup, Whoosh finds my document (as below):
>>> SearchQuerySet().all().models(PedidoSaida).filter(numero__exact='6210202443/10')
[<SearchResult: logistica.pedidosaida (pk=u'6')>]
2) If I just run a startswith lookup, Whoosh doesn't find my document (as below):
>>> SearchQuerySet().all().models(PedidoSaida).filter(numero__startswith='6210202443/10')
[]
3) If I put all together in a single OR query, Whoosh still doesn't find my document (as below):
>>> SearchQuerySet().all().models(PedidoSaida).filter(SQ(numero__exact='6210202443/10') | SQ(numero__startswith='6210202443/10'))
[]
Taking a look into the queries that Haystack sends to Whoosh, we have:
>>> str(SearchQuerySet().all().models(PedidoSaida).filter(numero__exact='6210202443/10').query)
'(numero:6210202443/10) AND (django_ct:logistica.pedidosaida)'
>>> str(SearchQuerySet().all().models(PedidoSaida).filter(numero__startswith='6210202443/10').query)
'(numero:6210202443/10*) AND (django_ct:logistica.pedidosaida)'
>>> str(SearchQuerySet().all().models(PedidoSaida).filter(SQ(numero__exact='6210202443/10') | SQ(numero__startswith='6210202443/10')).query)
'((numero:6210202443/10 OR numero:6210202443/10*)) AND (django_ct:logistica.pedidosaida)'
As you can observe, the last query is exactly (first OR second). Shouldn't Whoosh find my document? I can't see where my logic is wrong: I'm using OR and it is finding less than when I use one of the statements.
I also think it is weird that Whoosh finds my document with the first query (numero:6210202443/10), but not with the second (numero:6210202443/10*) one. But I guess it has to do with StemmingAnalyzer that Haystack uses in my CharField. I'll take a deep look into that after.
Upvotes: 2
Views: 419
Reputation: 493
Following @Eevee ideas, I did some tests. Check this one:
>>> QueryParser("content", schema=None).parse('((numero:6210202443/10 OR (numero:6210202443/10*))) AND (django_ct:logistica.pedidosaida)')
And([
Or([
Term('numero', '6210202443/10'),
And([
Term('numero', '6210202443/'),
Prefix('content', '10')
])
]),
Term('django_ct', 'logistica.pedidosaida')
])
It seems that /
has precedence over OR
. Does it make sense? I think that logical operators should have highest precedence. Do you agree?
If this behaviour is correct than I guess it is a bug in Haystack query generator. Isn't it?
I want to contribute with a patch but I'm not sure if it is really a bug in the parser. Depends on precedence that makes more sense.
Upvotes: 0
Reputation: 48536
You can use a QueryParser
directly to see how Whoosh is parsing that query:
>>> from whoosh.qparser import QueryParser
>>> QueryParser("content", schema=None).parse('((numero:6210202443/10 OR numero:6210202443/10*)) AND (django_ct:logistica.pedidosaida)')
And([Or([Term('numero', '6210202443/10'), Term('numero', '6210202443/')]), Prefix('content', '10'), Term('django_ct', 'logistica.pedidosaida')])
Let's reformat that last line:
And([
Or([
Term('numero', '6210202443/10'),
Term('numero', '6210202443/'),
]),
Prefix('content', '10'),
Term('django_ct', 'logistica.pedidosaida'),
])
So it looks like *
is binding more tightly than the /
in your search term. I could see arguing this as a bug in whoosh, sure. (I'm sure the maintainer would love your patch ☺)
Workarounds coming to mind:
Build the query yourself instead of round-tripping through Whoosh's fuzzily-defined and human-oriented query language. Of course, that only works if your index is on the same machine and you're reading it with the same process; I don't know much about Haystack.
Avoid using slashes in the numero
field. Change them to something less likely to look like query syntax, like underscores.
Avoid including the slash when you do a prefix search; for example, 6210202443*
works fine anywhere in a query.
Upvotes: 1