Reputation: 863
I am trying to build a search application for resumes which are in .pdf format. For a given search query like "who is proficient in Java and worked in an MNC", the output should be the CV which is most similar. My plan is to read pdf text and find the cosine similarity between the text and the query.
However, BERT has a problem with long documents. It supports a sequence length of only 512 but all my CVs have more than 1000 words. I am really stuck here. Methods like truncating the documents don't suit the purpose.
Is there any other model that can do this?
I could not find the right approach with models like Longformer and XLNet for this task.
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
print ("module %s loaded" % module_url)
corpus = list(documents.values())
sentence_embeddings = model(corpus)
query = "who is profiecient in C++ and has Rust"
query_vec = model([query.lower()])[0]
doc_names = list(documents.keys())
results = []
for i,sent in enumerate(corpus):
sim = cosine(query_vec, model([sent])[0])
results.append((i,sim))
#print("Document = ", doc_name[i], "; similarity = ", sim)
print(results)
results= sorted(results, key=lambda x: x[1], reverse=True)
print(results)
for idx, distance in results[:5]:
print(doc_names[idx].strip(), "(Cosine Score: %.4f)" % (distance))
Upvotes: 1
Views: 1477
Reputation: 909
I advise you to read: Beltagy, Iz, Matthew E. Peters, and Arman Cohan. "Longformer: The long-document transformer." arXiv preprint arXiv:2004.05150 (2020).
The main goal of this paper is that it is able to receive long document sequence tokens as input and is able to process long-term cross-partition context across the document with a linear computational cost.
Here, the sliding window attention mechanism uses n = 512
tokens instead of what is known in the BERT model which takes N=512
tokens as input sequence length.
📍 Longformer: The Long-Document Transformer
GitHub: https://github.com/allenai/longformer
Paper: https://arxiv.org/abs/2004.05150
Upvotes: 1