Augusto Spinelli
Augusto Spinelli

Reputation: 21

How to create a custom analyzer to ignore accents and pt-br stopwords using elasticsearch nest api?

First of all, consider that I am using a "News" Class (Noticia, in portuguese) that has a string field called "Content" (Conteudo in portuguese)

public class Noticia
{
    public string Conteudo { get; set; } 
}

I am trying to create an index that is configured to ignore accents and pt-br stopwords as well as to allow up to 40mi chars to be analysed in a highligthed query.

I can create such an index using this code:

var createIndexResponse = client.Indices.Create(indexName, c => c
    .Settings(s => s
        .Setting("highlight.max_analyzed_offset" , 40000000)
        .Analysis(analysis => analysis
            .TokenFilters(tokenfilters => tokenfilters
                .AsciiFolding("folding-accent", ft => ft
                )
                .Stop("stoping-br", st => st
                    .StopWords("_brazilian_")
                )
            )
            .Analyzers(analyzers => analyzers
                .Custom("folding-analyzer", cc => cc
                    .Tokenizer("standard")
                    .Filters("folding-accent", "stoping-br")
                )
            )
        )
    )
    .Map<Noticia>(mm => mm
        .AutoMap()
        .Properties(p => p
            .Text(t => t
                .Name(n => n.Conteudo)
                .Analyzer("folding-analyzer")
            )
        )
    )
);

If I test this analyzer using Kibana Dev Tools, I get the result that I want: No accents and stopwords removed!

POST intranet/_analyze
{
  "analyzer": "folding-analyzer",
  "text": "Férias de todos os funcionários"
}

Result:

{
  "tokens" : [
    {
      "token" : "Ferias",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "funcionarios",
      "start_offset" : 19,
      "end_offset" : 31,
      "type" : "<ALPHANUM>",
      "position" : 4
    }
  ]
}

The same (good) results are being returned when I use NEST to analyze a query using my folding analyser (Tokens "Ferias" e "funcionarios" are returned)

var analyzeResponse = client.Indices.Analyze(a => a
.Index(indexName)
.Analyzer("folding-analyzer")
.Text("Férias de todos os funcionários")
);

However, If I perform a search using NEST ElasticSearch .NET client, terms like "Férias" (with accent) and "Ferias" (without accent) are beign treated as different.

My goal is to perform a query that returns all results, no matter if the word is Férias or Ferias

Thats the simplified code (C# nest) I am using to query elasticsearch:

var searchResponse = ElasticClient.Search<Noticia>(s => s
    .Index(indexName)
    .Query(q => q
    .MultiMatch(m => m
                .Fields(f => f
                    .Field(p => p.Titulo,4)
                    .Field(p => p.Conteudo,2)
                )
                .Query(termo)
            )
    )
);

and that's the extended API call associated with the searchResponse

Successful (200) low level call on POST: /intranet/_search?pretty=true&error_trace=true&typed_keys=true
# Audit trail of this API call:
 - [1] HealthyResponse: Node: ###NODE ADDRESS### Took: 00:00:00.3880295
# Request:
{"query":{"multi_match":{"fields":["categoria^1","titulo^4","ementa^3","conteudo^2","attachments.attachment.content^1"],"query":"Ferias"}},"size":100}
# Response:
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 13.788051,
    "hits" : [
      {
        "_index" : "intranet",
        "_type" : "_doc",
        "_id" : "4934",
        "_score" : 13.788051,
        "_source" : {
          "conteudo" : "blablabla ferias blablabla",
          "attachments" : [ ],
          "categoria" : "Novidades da Biblioteca - DBD",
          "publicadaEm" : "2008-10-14T00:00:00",
          "titulo" : "INFORMATIVO DE DIREITO ADMINISTRATIVO E LRF - JUL/2008",
          "ementa" : "blablabla",
          "matriculaAutor" : 900794,
          "atualizadaEm" : "2009-02-03T13:44:00",
          "id" : 4934,
          "indexacaoAtiva" : true,
          "status" : "Disponível"
        }
      }
    ]
  }
}

I have also tryed to use Multi Fields and Suffix in a query, without success

.Map<Noticia>(mm => mm
    .AutoMap()
    .Properties(p => p
        .Text(t => t
        .Name(n => n.Conteudo)
        .Analyzer("folding-analyzer")
        .Fields(f => f
            .Text(ss => ss
                .Name("folding")
                .Analyzer("folding-analyzer")
                )
        )

(...)

var searchResponse = ElasticClient.Search<Noticia>(s => s
    .Index(indexName)   
    .Query(q => q
    .MultiMatch(m => m
        .Fields(f => f
        .Field(p => p.Titulo,4)
        .Field(p => p.Conteudo.Suffix("folding"),2)       
                )
                .Query(termo)
            )
    )
);

Any clue what I am doing wrong or what I can do to reach my goal?

Thanks a lot in advance!

Upvotes: 0

Views: 456

Answers (1)

Augusto Spinelli
Augusto Spinelli

Reputation: 21

After a few days I found out what I was doing wrong and it was all about the mapping.

Here are the steps I took to approach the problem and solve it in the end

1 - first of all I`ve opened kibana console and found out that only the last field of my mapped fields was being assigned to my custom analyser (folding-analyser)

To test each one of your fields you can use the GET FIELD MAPPING API and a command in dev tools like this:

GET /<index>/_mapping/field/<field>

then you'll be able to see if your analyser is being assigned to your field or not

2 - After that, I discovered that the last field was the only one being assigned to my custom analyser and the reason was because I was messing up with fluent mapping in two ways:

  • First of all, I had to chain my text properties correctly
  • Second of all, I was trying to map another POCO class in another Map<> clause, when I was supposed to use the Object<> clause

the correct mapping that worked for me was a bit like this:

.Map<Noticia>(mm => mm
        .AutoMap()
        .Properties(p => p
            .Text(t => t
                .Name(n => n.Field1)
                .Analyzer("folding-analyzer")
            )
            .Text(t => t
                .Name(n => n.Field2)
                .Analyzer("folding-analyzer")
            )
            .Object<NoticiaArquivo>(o => o
                .Name(n => n.Arquivos)
                .Properties(eps => eps
                    .Text(s => s
                        .Name(e => e.NAField1)
                        .Analyzer("folding-analyzer")
                    )
                    .Text(s => s
                        .Name(e => e.NAField2)
                        .Analyzer("folding-analyzer")
                    )
                )
            )
        )
    )

Finally, It's important to share that when you assign an analyser using the .Analyzer("analiserName") clause, you're telling elastic search that you want to use the argument analyser both for indexing and search

If you want to use an analyser only when you search and not on indexing time, you should use the .SearchAnalyzer("analiserName") clause.

Upvotes: 0

Related Questions