Usage of Document Extraction cognitive skill

Question

I am trying to utilize Azure Cognitive services to perform basic document extraction.

My intent is to input PDFs and DOCXs (and possibly some other files) into the Cognitive Engine for parsing, but unfortunately, the implementation of this is not as simple as it seems.

According to the documentation (https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-document-extraction#sample-definition), I must define the skill and then I should be able to input files, but there is no examples on how this should be done.

So far I have been able to define the skill but I am still not sure where I should be dropping the files into.

Please see my code below, as it seeks to replicate the same data structure shown in the example code (albeit using the C# Library)

public static DocumentExtractionSkill CreateDocumentExtractionSkill()
{
    List inputMappings = new List
    {
        new("file_data") {Source = "/document/file_data"}
    };

    List outputMappings = new List
    {
        new("content") {TargetName = "extracted_content"}
    };

    DocumentExtractionSkill des = new DocumentExtractionSkill(inputMappings, outputMappings)
    {
        Description = "Extract text (plain and structured) from image",
        ParsingMode = BlobIndexerParsingMode.Text,
        DataToExtract = BlobIndexerDataToExtract.ContentAndMetadata,
        Context = "/document",
    };

    return des;
}

And then I build on this skill like so:

_indexerClient = new SearchIndexerClient(new Uri(Environment.GetEnvironmentVariable("SearchEndpoint")), new AzureKeyCredential(Environment.GetEnvironmentVariable("SearchKey"));
List skills = new List { Skills.DocExtractionSkill.CreateDocumentExtractionSkill() };

SearchIndexerSkillset skillset = new SearchIndexerSkillset("DocumentSkillset", skills)
{
    Description = "Document Cracker Skillset",
    CognitiveServicesAccount = new CognitiveServicesAccountKey(Environment.GetEnvironmentVariable("CognitiveServicesKey"))
};


await _indexerClient.CreateOrUpdateSkillsetAsync(skillset);

And... then what?

There is no clear method that would fit what I believe the next stage, actually parsing documents.

What is the next step from here to begin dumping files into the _indexerClient (of type SearchIndexerClient)?

As the next stage shown in the documentation is:

{
  "values": [
    {
      "recordId": "1",
      "data":
      {
        "file_data": {
          "$type": "file",
          "data": "aGVsbG8="
        }
      }
    }
  ]
}

Which is not clear as to where I would be doing this.

Usage of Document Extraction cognitive skill

Answers (1)

Related Questions