Bitz
Bitz

Reputation: 1148

Usage of Document Extraction cognitive skill

I am trying to utilize Azure Cognitive services to perform basic document extraction.

My intent is to input PDFs and DOCXs (and possibly some other files) into the Cognitive Engine for parsing, but unfortunately, the implementation of this is not as simple as it seems.

According to the documentation (https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-document-extraction#sample-definition), I must define the skill and then I should be able to input files, but there is no examples on how this should be done.

So far I have been able to define the skill but I am still not sure where I should be dropping the files into.

Please see my code below, as it seeks to replicate the same data structure shown in the example code (albeit using the C# Library)

public static DocumentExtractionSkill CreateDocumentExtractionSkill()
{
    List<InputFieldMappingEntry> inputMappings = new List<InputFieldMappingEntry>
    {
        new("file_data") {Source = "/document/file_data"}
    };

    List<OutputFieldMappingEntry> outputMappings = new List<OutputFieldMappingEntry>
    {
        new("content") {TargetName = "extracted_content"}
    };

    DocumentExtractionSkill des = new DocumentExtractionSkill(inputMappings, outputMappings)
    {
        Description = "Extract text (plain and structured) from image",
        ParsingMode = BlobIndexerParsingMode.Text,
        DataToExtract = BlobIndexerDataToExtract.ContentAndMetadata,
        Context = "/document",
    };

    return des;
}

And then I build on this skill like so:

_indexerClient = new SearchIndexerClient(new Uri(Environment.GetEnvironmentVariable("SearchEndpoint")), new AzureKeyCredential(Environment.GetEnvironmentVariable("SearchKey"));
List<SearchIndexerSkill> skills = new List<SearchIndexerSkill> { Skills.DocExtractionSkill.CreateDocumentExtractionSkill() };

SearchIndexerSkillset skillset = new SearchIndexerSkillset("DocumentSkillset", skills)
{
    Description = "Document Cracker Skillset",
    CognitiveServicesAccount = new CognitiveServicesAccountKey(Environment.GetEnvironmentVariable("CognitiveServicesKey"))
};


await _indexerClient.CreateOrUpdateSkillsetAsync(skillset);

And... then what?

There is no clear method that would fit what I believe the next stage, actually parsing documents.

What is the next step from here to begin dumping files into the _indexerClient (of type SearchIndexerClient)?

As the next stage shown in the documentation is:

{
  "values": [
    {
      "recordId": "1",
      "data":
      {
        "file_data": {
          "$type": "file",
          "data": "aGVsbG8="
        }
      }
    }
  ]
}

Which is not clear as to where I would be doing this.

Upvotes: 2

Views: 1944

Answers (1)

SwethaKandikonda
SwethaKandikonda

Reputation: 8244

According to the document that you have mentioned. They are actually trying to get the output through postman. They are using a GET Method to receive the extracted document content by sending JSON request to the mentioned URL(i.e. Cognitive skill url) and the files/documents are needed to be uploaded to your storage account in order to get extracted. enter image description here

you can follow this tutorial to get more insights.

Upvotes: 1

Related Questions