Azure Search why OCRed text is not merged in correct order in merged_content field?

Question

I need to develop my own webapi custom skill that make us of Read API. I will use it in my custom skillset. I can't use built-in OCR skill from Azure Cognitive Search (t

Output of my webapi skill looks like this:

// logic to get result...
// now creating output to custom skill

    var textUrlFileResults = results.AnalyzeResult.ReadResults;
    foreach (ReadResult page in textUrlFileResults)
    {
        var newValue = new
        {
            RecordId = value.RecordId,
            Data = new
            {
                text = string.Join(" ", page.Lines?.Select(x => x.Text))
            }
        };

        output.Values.Add(newValue);
    }
}



return new OkObjectResult(output);

And here is my skillset definition:

  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
      "name": "#1",
      "context": "/document",
      "insertPreTag": " ",
      "insertPostTag": " ",
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        },
        {
          "name": "itemsToInsert",
          "source": "/document/normalized_images/*/text"
        },
        {
          "name": "offsets",
          "source": "/document/normalized_images/*/contentOffset"
        }
      ],
      "outputs": [
        {
          "name": "mergedText",
          "targetName": "merged_content"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
      "name": "#2",
      "description": null,
      "context": "/document/normalized_images/*",
      // i cut some info
      "inputs": [
        {
          "name": "image",
          "source": "/document/normalized_images/*"
        }
      ],
      "outputs": [
        {
          "name": "text",
          "targetName": "text"
        }
      ]
    }
  ],

I am trying to OCR pdf document that look like this:

And in Index i get this document that looks like this:

{
  "@odata.context": " cutted ",
  "value": [
    {
      "@search.score": 1,
      "content": "
Text before shell

Text after shell

Text after bw







And here second page


",
      "merged_content": "
Text before shell

Text after shell

Text after bw

 SHELL 1900 1904 1909 1930 1948 SHELL SHELL Shell Shell 1955 1961 1971 1995 1999 

 B+W BLACK+WHITE PHOTOGRAPHY 



And here second page


",
      "text": [
        "SHELL 1900 1904 1909 1930 1948 SHELL SHELL Shell Shell 1955 1961 1971 1995 1999",
        "B+W BLACK+WHITE PHOTOGRAPHY"
      ],
      "layoutText": [],
      "textFromOcr": "[\"SHELL 1900 1904 1909 1930 1948 SHELL SHELL Shell Shell 1955 1961 1971 1995 1999\",\"B+W BLACK+WHITE PHOTOGRAPHY\"]"
    }
  ]
}

My question is, why OCRed text is not placed in correct order with standard text when i am using /document/normalized_images/*/contentOffset" in MergeSkill? To be honest my skillset is copy-pasted from ms docs and it is not working as expected. I dont really understand, what special comes from OCR skill. I need to develop my own OCR skill, i can't use OCR from Search out of the box, i need to write it on my own.

Gia Mondragon - MSFT · Accepted Answer

Unfortunately, that is the behavior of the skill by design. It gets the text first and leave the image translation at the bottom. This is not something that can be changed at this time with code within the skill due to an implementation limitation. Changes to OCR skill documentation have been made to reflect this, and it will be published hopefully this week, to clarify and avoid confusion.

Azure Search why OCRed text is not merged in correct order in merged_content field?

Answers (1)

Related Questions