Reputation: 11
We have created an Azure AI Search using Split Skillset to chunk our documents page wise and semantic configuration is also . Now we want to include the page number either at the end of the chunk content or as a separate index field. We have below fields in our index: parent_id, title, chunk_id, chunk.
Below are the parameters of split skillset :-
text_split_mode="pages",
context="/document",
maximum_page_length=2048,
page_overlap_length=20,
I tried to extract page number from the chunk_id : 0ffd977ce09f_aHR0cHM6Ly9hbmh@cmFpbnN0b3JhZ2VhY2NvdW50LmJsb2luS53aW5kb3dzLm5ldC9maWx1dXBsb2FkLXRlc3QtaW5kZXgvQmx1ZV9aZWJyYV9MYW5kbG9yZF9JbnN1cmFuY2VfQwNjaWRlbnRhbF9EYW1hZ2VFUERTXIwMjMwNzAxLnBkZg_pages_2 ut when I checked the content of these chunks with the actual document, the page number in the chunk id is not correct. In the document the content is at 21 but in chunk it's showing 26. How can I handle this?
Upvotes: 0
Views: 183
Reputation: 8055
You can use the imageAction
configuration for extracting page number and content from each page.
By using this you will get a field name called pageNumber
and you will get content from each page as separate document in secondary index.
Below is the sample definitions.
Primary index fields.
Skillset definition - using OCR skillset the text is extracted from each page and done projection on secondary index.
{
"name": "skillset1",
"description": "",
"skills": [
{
"@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
"name": "#1",
"description": "",
"context": "/document/normalized_images/*",
"inputs": [
{
"name": "image",
"source": "/document/normalized_images/*",
"inputs": []
}
],
"outputs": [
{
"name": "text",
"targetName": "text"
}
],
"defaultLanguageCode": "en",
"detectOrientation": true,
"lineEnding": "Space"
}
],
"@odata.etag": "\"0x8DCFB1DC59EC9ED\"",
"indexProjections": {
"selectors": [
{
"targetIndexName": "desidx",
"parentKeyFieldName": "parent_id",
"sourceContext": "/document/normalized_images/*",
"mappings": [
{
"name": "text",
"source": "/document/normalized_images/*/text"
},
{
"name": "pageNumber",
"source": "/document/normalized_images/*/pageNumber"
},
{
"name": "metadata_storage_path",
"source": "/document/metadata_storage_name"
}
]
}
],
"parameters": {
"projectionMode": "skipIndexingParentDocuments"
}
}
}
Whenever you given image action as generateNormalizedImagePerPage
each page data will be in /document/normalized_images/*
context.
Indexer
{
"@odata.context": "https://jgsai.search.windows.net/$metadata#indexers/$entity",
"@odata.etag": "\"0x8DCFB1DD13443EF\"",
"name": "azureblob-indexer",
"description": "",
"dataSourceName": "ds",
"skillsetName": "skillset1",
"targetIndexName": "srcidx",
"disabled": null,
"schedule": null,
"parameters": {
"batchSize": null,
"maxFailedItems": 0,
"maxFailedItemsPerBatch": 0,
"base64EncodeKeys": null,
"configuration": {
"dataToExtract": "contentAndMetadata",
"parsingMode": "default",
"imageAction": "generateNormalizedImagePerPage"
}
},
"fieldMappings": [
{
"sourceFieldName": "metadata_storage_path",
"targetFieldName": "metadata_storage_path",
"mappingFunction": {
"name": "base64Encode",
"parameters": null
}
}
],
"outputFieldMappings": [],
"cache": null,
"encryptionKey": null
}
Secondary index fields
Output:
Here, is the more properties of image action.
Upvotes: 0