michasaucer
michasaucer

Reputation: 5228

Azure Search why OCRed text is not merged in correct order in merged_content field?

I need to develop my own webapi custom skill that make us of Read API. I will use it in my custom skillset. I can't use built-in OCR skill from Azure Cognitive Search (t

Output of my webapi skill looks like this:

// logic to get result...
// now creating output to custom skill

    var textUrlFileResults = results.AnalyzeResult.ReadResults;
    foreach (ReadResult page in textUrlFileResults)
    {
        var newValue = new
        {
            RecordId = value.RecordId,
            Data = new
            {
                text = string.Join(" ", page.Lines?.Select(x => x.Text))
            }
        };

        output.Values.Add(newValue);
    }
}



return new OkObjectResult(output);

And here is my skillset definition:

  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
      "name": "#1",
      "context": "/document",
      "insertPreTag": " ",
      "insertPostTag": " ",
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        },
        {
          "name": "itemsToInsert",
          "source": "/document/normalized_images/*/text"
        },
        {
          "name": "offsets",
          "source": "/document/normalized_images/*/contentOffset"
        }
      ],
      "outputs": [
        {
          "name": "mergedText",
          "targetName": "merged_content"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
      "name": "#2",
      "description": null,
      "context": "/document/normalized_images/*",
      // i cut some info
      "inputs": [
        {
          "name": "image",
          "source": "/document/normalized_images/*"
        }
      ],
      "outputs": [
        {
          "name": "text",
          "targetName": "text"
        }
      ]
    }
  ],

I am trying to OCR pdf document that look like this: enter image description here

And in Index i get this document that looks like this:

{
  "@odata.context": " cutted ",
  "value": [
    {
      "@search.score": 1,
      "content": "\nText before shell\n\nText after shell\n\nText after bw\n\n\n\n\n\n\n\nAnd here second page\n\n\n",
      "merged_content": "\nText before shell\n\nText after shell\n\nText after bw\n\n SHELL 1900 1904 1909 1930 1948 SHELL SHELL Shell Shell 1955 1961 1971 1995 1999 \n\n B+W BLACK+WHITE PHOTOGRAPHY \n\n\n\nAnd here second page\n\n\n",
      "text": [
        "SHELL 1900 1904 1909 1930 1948 SHELL SHELL Shell Shell 1955 1961 1971 1995 1999",
        "B+W BLACK+WHITE PHOTOGRAPHY"
      ],
      "layoutText": [],
      "textFromOcr": "[\"SHELL 1900 1904 1909 1930 1948 SHELL SHELL Shell Shell 1955 1961 1971 1995 1999\",\"B+W BLACK+WHITE PHOTOGRAPHY\"]"
    }
  ]
}

My question is, why OCRed text is not placed in correct order with standard text when i am using /document/normalized_images/*/contentOffset" in MergeSkill? To be honest my skillset is copy-pasted from ms docs and it is not working as expected. I dont really understand, what special comes from OCR skill. I need to develop my own OCR skill, i can't use OCR from Search out of the box, i need to write it on my own.

Upvotes: 0

Views: 230

Answers (1)

Gia Mondragon - MSFT
Gia Mondragon - MSFT

Reputation: 466

Unfortunately, that is the behavior of the skill by design. It gets the text first and leave the image translation at the bottom. This is not something that can be changed at this time with code within the skill due to an implementation limitation. Changes to OCR skill documentation have been made to reflect this, and it will be published hopefully this week, to clarify and avoid confusion.

Upvotes: 1

Related Questions