Lee Simpson
Lee Simpson

Reputation: 309

Approach for extracting relevant text using Azure Cognitive Search

Context: I have a set of documents in SharePoint. I have set up Azure Cognitive Search (Standard tier) with data sources (SharePoint), index and indexers. I have also added a semantic configuration.

Outcome: Ask a question, and have the search find and return relevant sections from the documents. I will use these sections to feed into OpenAI to construct a cohesive result.

I would like to replicate this Microsoft demo: https://www.youtube.com/watch?v=3t3qZu1Dy1k&t=572s It seems to me to create this 'demo' each document content is very small and they could easily be combined to pass into OpenAI.

My experience so far: The results return the documents and rank them, which seems OK - however it returns a short 'caption' and the full text. The caption is not necessarily related to my question - and can therefore not be used for the next step. The full document is far too big to be used in OpenAI.

I have managed to get Semantic answers - however the question has to be so precise to get a result, and the associated text is limited.

What I would like: I would like the search to return sub-sections of the document, where the results of my question may be. If that is not supported, I feel I need an entirely new approach.

Any ideas? Thanks in advance for your time.

Upvotes: 1

Views: 1676

Answers (1)

Dan Gøran Lunde
Dan Gøran Lunde

Reputation: 5353

The demo you refer to works by feeding documents to Azure Cognitive Search. A query is then formulated as a question that uses the Semantic Search functionality to return a set of potential semantic answers extracted from the content in the index.

These potential semantic answers are then fed as a prompt to OpenAI's text completion service: https://beta.openai.com/docs/guides/completion

Screenshot from demo on YouTube

First, you must ensure you can get good semantic answers. Inspect the content you have indexed and verify that it contains content that could semantically be an answer to the questions you test with. Good content should have declarations of facts. I.e., statements that could be used verbatim as an answer to a question. Examples:

  • The capital of France is Paris.
  • Forecast for 2022 is expected to be 22%.

The semantic functionality in Azure Search will only respond with a text section containing a potential answer to your question. If you can't get this step to work, you have to work on improving that. Either via semantic configuration, choice of content, or by making sure you process your content so that the items in your index contain the relevant content in the correct properties.

  1. Ensure your content is indexed and mapped to properties in a sensible way
  2. Work with the semantic configuration until you get sensible results
  3. Once the previous two steps are ok, submit to OpenAI

I have tested the semantic text on two different data sets. Both were a combination of website content, PDF- and Word documents, etc. The topic and volume of content were essentially the same. From one data set, I could get excellent semantic answers. But, the other data set was disappointing.

My conclusion was that the content in the good data set was formulated and structured in a way that fits a semantic scenario. The other data set would often have logic and meaning presented in tables and layouts. As a human reading the content on paper, you would understand it. But, semantically, it would not make as much sense.

Upvotes: 0

Related Questions