azure search index html content

Question

I know blob storage is the only data source (thus far) that supports the indexing of html content.

My question is, is it possible to strip content using a custom analyser and the charfilter 'html_strip' (mentioned in azure docs) before adding a document to an index via REST?

Here is my create index payload:

    {
      "name": "htmlindex",  
      "fields": [
      {"name": "id", "type": "Edm.String", "key": true, "searchable": false},
      {"name": "title", "type": "Edm.String", "filterable": true, "sortable": true, "facetable": true},
      {"name": "html", "type": "Collection(Edm.String)", "analyzer": "htmlAnalyzer"}
      ],
      "analyzers": [
      {
        "name": "htmlAnalyzer",
        "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
        "charFilters": [ "html_strip" ],
        "tokenizer": "standard_v2"
      }
      ]
    }

Here is my add document to index payload:

    {
      "value": [
        {
          "id": "1",
          "title": "title1",
          "html": [
            "test1",
            "test2"
          ]
        }
      ]
    }

Now when I search the index, I see the html content is not being stripped :

    {
      "@odata.context": "https://deviqfy.search.windows.net/indexes('htmlindex')/$metadata#docs",
      "value": [
          {
              "@search.score": 1,
              "id": "1",
              "title": "title1",
              "html": [
                  "test1",
                  "test2"
              ]
          }
      ]
    }

What am I doing wrong? How can I accomplish the stripping of html from the content before I add it? Without a pre-step..

azure search index html content

Answers (1)

Related Questions