Best approach for semantic similarity in large documents using BERT or LSTM models

Question

I am trying to build a search application for resumes which are in .pdf format. For a given search query like "who is proficient in Java and worked in an MNC", the output should be the CV which is most similar. My plan is to read pdf text and find the cosine similarity between the text and the query.

However, BERT has a problem with long documents. It supports a sequence length of only 512 but all my CVs have more than 1000 words. I am really stuck here. Methods like truncating the documents don't suit the purpose.

Is there any other model that can do this?

I could not find the right approach with models like Longformer and XLNet for this task.

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" 
model = hub.load(module_url)
print ("module %s loaded" % module_url)

corpus = list(documents.values())
sentence_embeddings = model(corpus)
query = "who is profiecient in C++ and has Rust"
query_vec = model([query.lower()])[0]

doc_names = list(documents.keys())

results = []
for i,sent in enumerate(corpus):
  sim = cosine(query_vec, model([sent])[0])
  results.append((i,sim))
  #print("Document = ", doc_name[i], "; similarity = ", sim)

print(results)
results= sorted(results, key=lambda x: x[1], reverse=True)
print(results)

for idx, distance in results[:5]:
  print(doc_names[idx].strip(), "(Cosine Score: %.4f)" % (distance))

Best approach for semantic similarity in large documents using BERT or LSTM models

Answers (1)

Related Questions