Banafshe khazali
Banafshe khazali

Reputation: 11

Filtering Documents Using Word Embeddings: Keep Job Postings, Exclude Resumes

I have a DataFrame containing a column of various documents, and I'm trying to filter out documents that resemble resumes while keeping job postings. To achieve this, I've utilized a CSV file provided here to find similarities between my document contents and resumes.

However, my current approach seems to return both resumes and job postings. I'm interested in retaining the job postings but excluding the resumes from my DataFrame.

def calculate_word_embedding_similarity(dataframe, text_to_compare, column_name='processed_content', embedding_model=None):
    text_tokens = text_to_compare.lower().split()
    dataframe_tokens = dataframe[column_name].str.lower().str.split()

    text_vector = sum(embedding_model[word] for word in text_tokens if word in embedding_model)
    dataframe_vectors = [
        sum(embedding_model[word] for word in tokens if word in embedding_model)
        for tokens in dataframe_tokens
    ]

    cosine_similarities = [
        cosine_similarity([text_vector], [dataframe_vector])[0][0]
        for dataframe_vector in dataframe_vectors
    ]

    dataframe['similarity'] = cosine_similarities

    dataframe = dataframe.sort_values(by='word_embedding_similarity', ascending=False)

    return dataframe

Could anyone suggest a method or modification to my approach that would allow me to achieve this filtering task effectively? I want to ensure that only job postings are retained in my DataFrame while eliminating resumes.

Upvotes: 1

Views: 96

Answers (0)

Related Questions