Reputation: 774
I am having some xml documents which have a structure like this:
<root>
<intro>...</intro>
...
<body>
<p>..................
some text CO<sub>2</sub>
.................. </p>
</body>
</root>
Now I want to search all the results with phrase CO2 and also want to get results of above type in search results. For this purpose, I am using this query -
cts:search
(fn:collection ("urn:iddn:collections:searchable"),
cts:element-query
(
fn:QName("http://iddn.icis.com/ns/fields","body"),
cts:word-query
(
"CO2",
("case-insensitive","diacritic-sensitive","punctuation-insensitive",
"whitespace-sensitive","unstemmed","unwildcarded","lang=en"),
1
)
)
,
("unfiltered", "score-logtfidf"),
0.0)
But using this I am not able to get document with CO<sub>2</sub>
. I am only getting data with simple phrase CO2
.
If I replace the search phrase to CO 2
then I am able to get documents only with CO<sub>2</sub>
and not with CO2
I want to get combined data for both CO<sub>2</sub>
and CO2
as search results.
So can I ignore <sub>
by any means, or is there any other way to cater this problem?
Upvotes: 2
Views: 238
Reputation: 4912
The issue here is tokenization. "CO2" is a single word token. CO<sub>2</sub>, even with phrase-through, is a phrase of two word tokens: "CO" and "2". Just as "blackbird" does not match "black bird", so too does "CO2" not match "CO 2". The phrase-through setting just means that we're willing to look for a phrase that crosses the <sub> element boundary.
You can't splice together CO<sub>2</sub> into one token, but you might be able to use customized tokenization overrides to break "CO2" into two tokens. Define a field and define overrides for the digits as 'symbol'. This will make each digit its own token and will break "CO2" into two tokens in the context of that field. You'd then need to replace the word-query with a field-word-query.
You probably don't want this to apply anywhere in a document, so you'd be best of adding markup around these kinds of chemical phrases in your documents. Fields in general and tokenization overrides in particular will come at a performance cost. The contents of a field are indexed completely separately so the index is bigger, and the tokenization overrides mean that we have to retokenize as well, both on ingest and at query time. This will slow things down a little (not a lot).
Upvotes: 5
Reputation: 7770
It appears that you want to add a phrase-through configuration.
Example:
<p>to <b>be</b> or not to be</p>
A phrase-through on <b>
would then be indexed as "to be or not to be"
Upvotes: 2