Reputation: 105
Im trying to improve the performance of my cognitve search in combination with openai. Currently im indexing my documents from sharepoint and there is where my problem starts. These are large files, so i need to split them so my cognitive search gives my openai only the information back it really needs and not the full document. For that i thought about a skillset and found "#Microsoft.Skills.Text.SplitSkill". My index looks currently like that:
{
"name" : "{{index-name}}",
"fields": [
{ "name": "id", "type": "Edm.String", "key": true, "searchable": false },
{ "name": "metadata_spo_item_name", "type": "Edm.String", "key": false, "searchable": true, "filterable": false, "sortable": false, "facetable": false },
{ "name": "metadata_spo_item_path", "type": "Edm.String", "key": false, "searchable": false, "filterable": false, "sortable": false, "facetable": false },
{ "name": "metadata_spo_item_content_type", "type": "Edm.String", "key": false, "searchable": false, "filterable": true, "sortable": false, "facetable": true },
{ "name": "metadata_spo_item_last_modified", "type": "Edm.DateTimeOffset", "key": false, "searchable": false, "filterable": false, "sortable": true, "facetable": false },
{ "name": "metadata_spo_item_size", "type": "Edm.Int64", "key": false, "searchable": false, "filterable": false, "sortable": false, "facetable": false },
{ "name": "content", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false },
{ "name": "pages", "type": "Collection(Edm.String)", "searchable": true, "filterable": false, "sortable": false, "facetable": false }
]
}
So now my goal is to use my skillset that if my content in a document of my index is to long it should split it into several documents, currently im stucking at this point.
{
"name": "{{skillset-name}}",
"description": "SharePoint skillset",
"skills": [
{
"@odata.type": "#Microsoft.Skills.Text.SplitSkill",
"name": "#1",
"description": null,
"context": "/document/id",
"defaultLanguageCode": "en",
"textSplitMode": "pages",
"maximumPageLength": 5000,
"inputs": [
{
"name": "text",
"source": "/document/content"
}
],
"outputs": [
{
"name": "textItems",
"targetName": "pages"
}
]
}
]
}
Split my content in an array with the name "pages", did not work. The goal is to split the content into several documents with the same filepath.
Upvotes: 3
Views: 1521
Reputation: 31
You need to use projections to map your split docs as a separate record https://learn.microsoft.com/en-us/azure/search/index-projections-concept-intro?tabs=kstore-rest
"indexProjections": {
"selectors": [
{
"targetIndexName": "<your index>",
"parentKeyFieldName": "<your key field>",
"sourceContext": "/document/pages/*",
"mappings": [
{
"name": "<your field to put content>",
"source": "/document/pages/*",
"sourceContext": null,
"inputs": []
},
]
}
],
"parameters": {
"projectionMode": "skipIndexingParentDocuments"
}
},
Upvotes: 3