Reputation: 23
I have a student exam system in Java. There are more than one million questions in the mysql database. The question content consists of Chinese, English, and latex mathematical formulas.
Now, I want to query questions with identical or very high similarity to the content of newly input questions from teachers.
How to solve this problem?
I want to use nlp to deal with this problem, but I don't know how to start.(gensim Word2Vec Doc2Vec)
I previously stored the question content in postgresql and searched for similar questions through long text. This worked well for Chinese questions with more content, but it did not work well for short English sentences.
Upvotes: 0
Views: 118
Reputation: 315
You could use vector databases such as ChromaDB. It is simple to test it out on some smaller subset of data, look at getting started. By default, ChromaDB uses Sentence Transformers (all-MiniLM-L6-v2 model) for embeddings.
Here is a working Python example from documentation:
import chromadb
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="my_collection")
collection.add(
documents=["This is a document", "This is another document"],
metadatas=[{"source": "my_source"}, {"source": "my_source"}],
ids=["id1", "id2"]
)
results = collection.query(
query_texts=["This is a query document"],
n_results=2
)
Upvotes: 0