Sam Grimmer
Sam Grimmer

Reputation: 65

azure search index html content

I know blob storage is the only data source (thus far) that supports the indexing of html content.

My question is, is it possible to strip content using a custom analyser and the charfilter 'html_strip' (mentioned in azure docs) before adding a document to an index via REST?

Here is my create index payload:

    {
      "name": "htmlindex",  
      "fields": [
      {"name": "id", "type": "Edm.String", "key": true, "searchable": false},
      {"name": "title", "type": "Edm.String", "filterable": true, "sortable": true, "facetable": true},
      {"name": "html", "type": "Collection(Edm.String)", "analyzer": "htmlAnalyzer"}
      ],
      "analyzers": [
      {
        "name": "htmlAnalyzer",
        "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
        "charFilters": [ "html_strip" ],
        "tokenizer": "standard_v2"
      }
      ]
    }

Here is my add document to index payload:

    {
      "value": [
        {
          "id": "1",
          "title": "title1",
          "html": [
            "<p>test1</p>",
            "<p>test2</p>"
          ]
        }
      ]
    }

Now when I search the index, I see the html content is not being stripped :

    {
      "@odata.context": "https://deviqfy.search.windows.net/indexes('htmlindex')/$metadata#docs",
      "value": [
          {
              "@search.score": 1,
              "id": "1",
              "title": "title1",
              "html": [
                  "<p>test1</p>",
                  "<p>test2</p>"
              ]
          }
      ]
    }

What am I doing wrong? How can I accomplish the stripping of html from the content before I add it? Without a pre-step..

Upvotes: 2

Views: 1298

Answers (1)

Arvind - MSFT
Arvind - MSFT

Reputation: 569

So the custom analyzers (and the associated character filters) are optional steps that you can perform prior to tokenizing text. These analyzers help us facilitate better full-text search.

Azure search doesn't have a mechanism for modifying the contents of the document to be indexed when using the REST API to push documents to your index. You will have to do that yourself, as the analyzers are used to extract terms from documents that are stored in the search index.

More details here if you are interested: https://learn.microsoft.com/en-us/azure/search/search-lucene-query-architecture

Upvotes: 1

Related Questions