Reputation: 41
I am have created an azure search resource in free tier and an index and indexer that is connected to a blob storage resource. Blob storage contains pdf files like FAQs, policies documents etc. I have enabled OCR and enrichments but when I do a search query it just returns the entire content of the PDF files.
My requirement is to retrieve only a chuck of PDF that answers the query, I am new to Azure and having difficulty to understand and find a related documentation that address this issue. I understand that enabling semantic search might help but I want to know if there is any way to do it without enabling it?
Upvotes: 3
Views: 2679
Reputation: 498
There are two ways to do that, if you want to do it in Azure Search, then hOCR can help you. hOCR is a custom skill which based on OCR and generates an hOCR document from the output of the OCR skill.
https://github.com/Azure-Samples/azure-search-power-skills/blob/main/Vision/HocrGenerator/README.md
The recognized text is stored in normal text nodes of the HTML file. The distribution into separate lines and words is here given by the surrounding span tags. Moreover, the usual HTML entities are used, for example the p tag for a paragraph. Additional information is given in the properties such as:
different layout elements such as "ocr_par", "ocr_line", "ocrx_word"
In your case, you are looking for "ocr_par" I think.
Second way you can do that by Azure Question Answering which now call Language Service, you can input your PDF as knowledgebase, then do the easy search.
I personally prefer the second one which easy to build.
Upvotes: 3