Jickson
Jickson

Reputation: 5193

How to build semantic search for a given domain

There is a problem we are trying to solve where we want to do a semantic search on our set of data, i.e we have a domain-specific data (example: sentences talking about automobiles)

Our data is just a bunch of sentences and what we want is to give a phrase and get back the sentences which are:

  1. Similar to that phrase
  2. Has a part of a sentence that is similar to the phrase
  3. A sentence which is having contextually similar meanings


Let me try giving you an example suppose I search for the phrase "Buying Experience", I should get the sentences like:


I want to lay emphasis on the fact that we are looking for contextual similarity and not just a brute force word search.

If the sentence uses different words then also it should be able to find it.

Things that we have already tried:

  1. Open Semantic Search the problem we faced here is generating ontology from the data we have, or for that sake searching for available ontology from different domains of our interest.

  2. Elastic Search(BM25 + Vectors(tf-idf)), we tried this where it gave a few sentences but precision was not that great. The accuracy was bad as well. We tried against a human-curated dataset, it was able to get around 10% of the sentences only.

  3. We tried different embeddings like the once mentioned in sentence-transformers and also went through the example and tried evaluating against our human-curated set and that also had very low accuracy.

  4. We tried ELMO. This was better but still lower accuracy than we expected and there is a cognitive load to decide the cosine value below which we shouldn't consider the sentences. This even applies to point 3.

Any help will be appreciated. Thanks a lot for the help in advance

Upvotes: 28

Views: 7143

Answers (3)

Bob van Luijt
Bob van Luijt

Reputation: 7598

You might be interested in looking into Weaviate to help you solve this problem. It is a smart graph based on the vectorization of data objects.

If you have domain-specific language (e.g., abbreviations) you can extend Weaviate with custom concepts.

You might be able to solve your problem with the semantic search features (i.e., Explore{}) or the automatic classification features.

Explore function

Because all data objects get vectorized, you can do a semantic search like the following (this example comes from the docs, you can try it out here using GraphQL):

{
  Get{
    Things{
      Publication(
        explore: {
          concepts: ["fashion"],
          certainty: 0.7,
          moveAwayFrom: {
            concepts: ["finance"],
            force: 0.45
          },
          moveTo: {
            concepts: ["haute couture"],
            force: 0.85
          }
        }
      ){
        name
      }
    }
  }
}

If you structure your graph schema based on -for example- the class name "Sentence", a similar query might look something like this:

{
  Get{
    Things{
      Sentence(
        # Explore (i.e., semantically) for "Buying Experience"
        explore: {
          concepts: ["Buying Experience"]
        }
        # Result must include the word "car" 
        where: {
          operator: Like
          path: ["content"]
          valueString: "*car*"
        }
      ){
        content
      }
    }
  }
}

Note:
You can also explore the graph semantically as a whole.

Automatic classification

An alternative might be working with the contextual or KNN classification features.

In your case, you might use the class Sentence and relate them to a class called Experience, which would have the property: buying (there are of course many other configurations and strategies you can choose from).

PS:
This video gives a bit more context if you like.

Upvotes: 1

Solo987
Solo987

Reputation: 1

As far as I know, I don’t think any theoretical model exists for building a semantic search engine. However, I believe a semantic search engine should be designed to cater to the specific requirements at hand. Having said that, any semantic search engine that is able to successfully understand the intent of the user as well as the context of the search term, needs to work with natural language processing (NLP) and machine learning as the building blocks.

Even though search engines work differently from search tools, you can refer to enterprise search tools to get an idea about a semantic search model that works. The new age platforms like 3RDi Search work on the principles of semantic search and have proved to be the ideal solution for the unstructured data that enterprises have to deal with. Google is very likely working on a model to introduce advanced semantics in search engine.

Upvotes: 0

user12909684
user12909684

Reputation:

I would highly suggest that you watch Trey Grainger's lecture on how to build a semantic search system => https://www.youtube.com/watch?v=4fMZnunTRF8. He talks about the anatomy of a semantic search system and each of the pieces used to fit together to deliver a final solution.

A great example of the contextual similarity is Bing's search engine: enter image description here

The original query had the terms {canned soda} and bing's search results can refer to {canned diet soda} , {soft drinks}, {unopened room temperature pop} or {carbonated drinks}. How did bing do this?:

Well, words that have similar meanings get similar vectors and then these vectors can be projected onto a 2-dimensional graph to be easily visualized. These vectors are trained by ensuring words with similar meanings are physically near each other. You can train your own vector based model by training the GloVe modelenter image description here

The closer the distances of the vectors are to each other the better. Now you can search for nearest neighbour queries based of the distance of their vectors. For example, for the query {how to stop animals from destroying my garden}, the nearest neighbour gives these results:

enter image description here

You can learn more about it here. For your case you can find a threshold for the maximum distance a vector of a sentence can be from the original search query for it to be consider a contextually similar sentence.

The contextual similarity can also possibly be done by reducing the vocabulary dimension using something like LSI(Latent Semantic Indexing). To do this in Python I would highly suggest you check out the genism library for python: https://radimrehurek.com/gensim/about.html.

Upvotes: 10

Related Questions