KevJo
KevJo

Reputation: 105

Splitting text with skillset in Azure

Im trying to improve the performance of my cognitve search in combination with openai. Currently im indexing my documents from sharepoint and there is where my problem starts. These are large files, so i need to split them so my cognitive search gives my openai only the information back it really needs and not the full document. For that i thought about a skillset and found "#Microsoft.Skills.Text.SplitSkill". My index looks currently like that:

{
    "name" : "{{index-name}}",
    "fields": [
        { "name": "id", "type": "Edm.String", "key": true, "searchable": false },
        { "name": "metadata_spo_item_name", "type": "Edm.String", "key": false, "searchable": true, "filterable": false, "sortable": false, "facetable": false },
        { "name": "metadata_spo_item_path", "type": "Edm.String", "key": false, "searchable": false, "filterable": false, "sortable": false, "facetable": false },
        { "name": "metadata_spo_item_content_type", "type": "Edm.String", "key": false, "searchable": false, "filterable": true, "sortable": false, "facetable": true },
        { "name": "metadata_spo_item_last_modified", "type": "Edm.DateTimeOffset", "key": false, "searchable": false, "filterable": false, "sortable": true, "facetable": false },
        { "name": "metadata_spo_item_size", "type": "Edm.Int64", "key": false, "searchable": false, "filterable": false, "sortable": false, "facetable": false },
        { "name": "content", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false },
        { "name": "pages", "type": "Collection(Edm.String)", "searchable": true, "filterable": false, "sortable": false, "facetable": false }
    ]
}

So now my goal is to use my skillset that if my content in a document of my index is to long it should split it into several documents, currently im stucking at this point.

{
    "name": "{{skillset-name}}",
    "description": "SharePoint skillset",
    "skills": [
        {
            "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
            "name": "#1",
            "description": null,
            "context": "/document/id",
            "defaultLanguageCode": "en",
            "textSplitMode": "pages",
            "maximumPageLength": 5000,
            "inputs": [
                {
                    "name": "text",
                    "source": "/document/content"
                }
            ],
            "outputs": [
                {
                    "name": "textItems",
                    "targetName": "pages"
                }
            ]
        }
    ]
}

Split my content in an array with the name "pages", did not work. The goal is to split the content into several documents with the same filepath.

Upvotes: 3

Views: 1521

Answers (1)

Zepsen Demetriy
Zepsen Demetriy

Reputation: 31

You need to use projections to map your split docs as a separate record https://learn.microsoft.com/en-us/azure/search/index-projections-concept-intro?tabs=kstore-rest

"indexProjections": {
"selectors": [
  {
    "targetIndexName": "<your index>",
    "parentKeyFieldName": "<your key field>",
    "sourceContext": "/document/pages/*",
    "mappings": [
      {
        "name": "<your field to put content>",
        "source": "/document/pages/*",
        "sourceContext": null,
        "inputs": []
      },
    ]
  }
],
"parameters": {
  "projectionMode": "skipIndexingParentDocuments"
}

},

Upvotes: 3

Related Questions