Hakuna0001
Hakuna0001

Reputation: 11

How to get page number from azure ai search index which is using split skillset?

We have created an Azure AI Search using Split Skillset to chunk our documents page wise and semantic configuration is also . Now we want to include the page number either at the end of the chunk content or as a separate index field. We have below fields in our index: parent_id, title, chunk_id, chunk.

Below are the parameters of split skillset :-

text_split_mode="pages",  
context="/document",  
maximum_page_length=2048,  
page_overlap_length=20,  

I tried to extract page number from the chunk_id : 0ffd977ce09f_aHR0cHM6Ly9hbmh@cmFpbnN0b3JhZ2VhY2NvdW50LmJsb2luS53aW5kb3dzLm5ldC9maWx1dXBsb2FkLXRlc3QtaW5kZXgvQmx1ZV9aZWJyYV9MYW5kbG9yZF9JbnN1cmFuY2VfQwNjaWRlbnRhbF9EYW1hZ2VFUERTXIwMjMwNzAxLnBkZg_pages_2 ut when I checked the content of these chunks with the actual document, the page number in the chunk id is not correct. In the document the content is at 21 but in chunk it's showing 26. How can I handle this?

Upvotes: 0

Views: 183

Answers (1)

JayashankarGS
JayashankarGS

Reputation: 8055

You can use the imageAction configuration for extracting page number and content from each page.

By using this you will get a field name called pageNumber and you will get content from each page as separate document in secondary index.

Below is the sample definitions.

Primary index fields.

enter image description here

Skillset definition - using OCR skillset the text is extracted from each page and done projection on secondary index.

{
  "name": "skillset1",
  "description": "",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
      "name": "#1",
      "description": "",
      "context": "/document/normalized_images/*",
      "inputs": [
        {
          "name": "image",
          "source": "/document/normalized_images/*",
          "inputs": []
        }
      ],
      "outputs": [
        {
          "name": "text",
          "targetName": "text"
        }
      ],
      "defaultLanguageCode": "en",
      "detectOrientation": true,
      "lineEnding": "Space"
    }
  ],
  "@odata.etag": "\"0x8DCFB1DC59EC9ED\"",
  "indexProjections": {
    "selectors": [
      {
        "targetIndexName": "desidx",
        "parentKeyFieldName": "parent_id",
        "sourceContext": "/document/normalized_images/*",
        "mappings": [
          {
            "name": "text",
            "source": "/document/normalized_images/*/text"
          },
          {
            "name": "pageNumber",
            "source": "/document/normalized_images/*/pageNumber"
          },
          {
            "name": "metadata_storage_path",
            "source": "/document/metadata_storage_name"
          }
        ]
      }
    ],
    "parameters": {
      "projectionMode": "skipIndexingParentDocuments"
    }
  }
}

Whenever you given image action as generateNormalizedImagePerPage each page data will be in /document/normalized_images/* context.

Indexer

{
  "@odata.context": "https://jgsai.search.windows.net/$metadata#indexers/$entity",
  "@odata.etag": "\"0x8DCFB1DD13443EF\"",
  "name": "azureblob-indexer",
  "description": "",
  "dataSourceName": "ds",
  "skillsetName": "skillset1",
  "targetIndexName": "srcidx",
  "disabled": null,
  "schedule": null,
  "parameters": {
    "batchSize": null,
    "maxFailedItems": 0,
    "maxFailedItemsPerBatch": 0,
    "base64EncodeKeys": null,
    "configuration": {
      "dataToExtract": "contentAndMetadata",
      "parsingMode": "default",
      "imageAction": "generateNormalizedImagePerPage"
    }
  },
  "fieldMappings": [
    {
      "sourceFieldName": "metadata_storage_path",
      "targetFieldName": "metadata_storage_path",
      "mappingFunction": {
        "name": "base64Encode",
        "parameters": null
      }
    }
  ],
  "outputFieldMappings": [],
  "cache": null,
  "encryptionKey": null
}

Secondary index fields

enter image description here

Output:

enter image description here

Here, is the more properties of image action.

Upvotes: 0

Related Questions