accbear
accbear

Reputation: 23

How to query questions with high similarity based on the input question content?

I have a student exam system in Java. There are more than one million questions in the mysql database. The question content consists of Chinese, English, and latex mathematical formulas.

Now, I want to query questions with identical or very high similarity to the content of newly input questions from teachers.

How to solve this problem?

I want to use nlp to deal with this problem, but I don't know how to start.(gensim Word2Vec Doc2Vec)

I previously stored the question content in postgresql and searched for similar questions through long text. This worked well for Chinese questions with more content, but it did not work well for short English sentences.

Upvotes: 0

Views: 118

Answers (1)

robocat314
robocat314

Reputation: 315

You could use vector databases such as ChromaDB. It is simple to test it out on some smaller subset of data, look at getting started. By default, ChromaDB uses Sentence Transformers (all-MiniLM-L6-v2 model) for embeddings.

Here is a working Python example from documentation:

import chromadb

chroma_client = chromadb.Client()

collection = chroma_client.create_collection(name="my_collection")

collection.add(
    documents=["This is a document", "This is another document"],
    metadatas=[{"source": "my_source"}, {"source": "my_source"}],
    ids=["id1", "id2"]
)

results = collection.query(
    query_texts=["This is a query document"],
    n_results=2
)

Upvotes: 0

Related Questions