Reputation: 29
If I search for "agua" no result as ignoring the accents (á, ã, ç) in search?
#city data base (id, name, uf, province_id)
1 Águas Clara PR 3
2 águas PR 4
3 Áraguaia PR 3
#schema.xml
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="15" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Upvotes: 1
Views: 1398
Reputation: 558
Use ASCIIFoldingFilterFactory in your filter chain, for both index and query. Using your example:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="15" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory" />
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Upvotes: 6
Reputation: 1165
Try to run an analysis to see what is happening when querying vs when indexing.
Try something like that, adjusting for the host, core and field names:
http://localhost:8983/solr/core1/analysis/field?wt=json&analysis.showmatch=true&analysis.fieldvalue=%C3%A1guas&analysis.query=%C3%A1guas&analysis.fieldname=name
The result will show how your query terms are handled at all the steps of the analysis
{
responseHeader:{
status:0,
QTime:2
},
analysis:{
field_types:{
},
field_names:{
Noms:{
index:[
"org.apache.lucene.analysis.standard.StandardTokenizer",
[
{
text:"état",
raw_bytes:"[c3 a9 74 61 74]",
start:0,
end:4,
type:"<ALPHANUM>",
position:1,
positionHistory:[
1
]
}
],
"org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter",
[
{
text:"etat",
raw_bytes:"[65 74 61 74]",
match:true,
position:1,
positionHistory:[
1,
1
],
start:0,
end:4,
type:"<ALPHANUM>"
}
],
"org.apache.lucene.analysis.core.StopFilter",
[
{
text:"etat",
raw_bytes:"[65 74 61 74]",
match:true,
position:1,
positionHistory:[
1,
1,
1
],
start:0,
end:4,
type:"<ALPHANUM>"
}
],
"org.apache.lucene.analysis.core.LowerCaseFilter",
[
{
text:"etat",
raw_bytes:"[65 74 61 74]",
match:true,
position:1,
positionHistory:[
1,
1,
1,
1
],
start:0,
end:4,
type:"<ALPHANUM>"
}
]
],
query:[
"org.apache.lucene.analysis.standard.StandardTokenizer",
[
{
text:"état",
raw_bytes:"[c3 a9 74 61 74]",
start:0,
end:4,
type:"<ALPHANUM>",
position:1,
positionHistory:[
1
]
}
],
"org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter",
[
{
text:"etat",
raw_bytes:"[65 74 61 74]",
position:1,
positionHistory:[
1,
1
],
start:0,
end:4,
type:"<ALPHANUM>"
}
],
"org.apache.lucene.analysis.synonym.SynonymFilter",
[
{
text:"etat",
raw_bytes:"[65 74 61 74]",
org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute#positionLength:1,
type:"<ALPHANUM>",
start:0,
end:4,
position:1,
positionHistory:[
1,
1,
1
]
}
],
"org.apache.lucene.analysis.core.StopFilter",
[
{
text:"etat",
raw_bytes:"[65 74 61 74]",
position:1,
positionHistory:[
1,
1,
1,
1
],
start:0,
end:4,
type:"<ALPHANUM>",
org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute#positionLength:1
}
],
"org.apache.lucene.analysis.core.LowerCaseFilter",
[
{
text:"etat",
raw_bytes:"[65 74 61 74]",
position:1,
positionHistory:[
1,
1,
1,
1,
1
],
start:0,
end:4,
type:"<ALPHANUM>",
org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute#positionLength:1
}
]
]
}
}
}
}
Upvotes: 0