Fatemeh
Fatemeh

Reputation: 3

Extracting only technical keywords from a text using RAKE library in Python

I want to use rake to extract technical keywords from a job description that I've found on Linkedin, which looks like this:

input = "In-depth understanding of the Python software development stacks, ecosystems, frameworks and tools such as Numpy, Scipy, Pandas, Dask, spaCy, NLTK, sci-kit-learn and PyTorch.Experience with front-end development using HTML, CSS, and JavaScript.
Familiarity with database technologies such as SQL and NoSQL.Excellent problem-solving ability with solid communication and collaboration skills.
Preferred Skills And QualificationsExperience with popular Python frameworks such as Django, Flask or Pyramid."

I run this code, as it's supposed to return the keywords.

from rake_nltk import Rake

r = Rake()
r.extract_keywords_from_text(input)
keywords = r.get_ranked_phrases_with_scores()

for score, keyword in keywords:
    if len(keyword.split()) == 1:  # Check if the keyword is one word
        print(f"{keyword}: {score}")

But the output is this:

frameworks: 2.0
tools: 1.0
sql: 1.0
spacy: 1.0
scipy: 1.0
sci: 1.0
qualificationsexperience: 1.0
pytorch: 1.0
pyramid: 1.0
pandas: 1.0
numpy: 1.0
nosql: 1.0
nltk: 1.0
learn: 1.0
kit: 1.0
javascript: 1.0
front: 1.0
flask: 1.0
familiarity: 1.0
experience: 1.0
ecosystems: 1.0
django: 1.0
dask: 1.0
css: 1.0

Simply I just want the explicit name of tools, skills and frameworks. Such as "Numpy", "Scipy", "HTML", etc That are used in the text and NOT every single word that's found in it (such as "experience" or "tools").

Is there any way to do so? Or should I just provide a list of all possible python frameworks and related skill and then filter the output of rake? If the latter one is the solution, How can I find/make a thorough list?

Any help is appreciated.

Upvotes: 0

Views: 250

Answers (2)

PromptCloud
PromptCloud

Reputation: 1

You can utilize skill and knowledge token classification from Hugging Face's library

from transformers import pipeline

token_skill_classifier = pipeline(model="jjzha/jobbert_skill_extraction", aggregation_strategy="first")
token_knowledge_classifier = pipeline(model="jjzha/jobbert_knowledge_extraction", aggregation_strategy="first")

def aggregate_span(results):
    new_results = []
    current_result = results[0]

    for result in results[1:]:
        if result["start"] == current_result["end"] + 1:
            current_result["word"] += " " + result["word"]
            current_result["end"] = result["end"]
        else:
            new_results.append(current_result)
            current_result = result

    new_results.append(current_result)

    return new_results

def ner(text):
    output_skills = token_skill_classifier(text)
    for result in output_skills:
        if result.get("entity_group"):
            result["entity"] = "Skill"
            del result["entity_group"]

    output_knowledge = token_knowledge_classifier(text)
    for result in output_knowledge:
        if result.get("entity_group"):
            result["entity"] = "Knowledge"
            del result["entity_group"]

    if len(output_skills) > 0:
        output_skills = aggregate_span(output_skills)
    if len(output_knowledge) > 0:
        output_knowledge = aggregate_span(output_knowledge)

    return {"text": text, "entities": output_skills}, {"text": text, "entities": output_knowledge}

Upvotes: 0

Alikbar
Alikbar

Reputation: 696

Rake is a domain independent keyword extraction algorithm, so you won't be able to use it to extract keywords related to a specific domain. You need to filter the output as the simplest solution and for that, you can use different documents similar to the link below to gather the data and make a list out of it. https://gist.github.com/pvanfas/8b4518996136d1a5ffc79513b3105033

Also trying out other libraries such as KeyBERT may improve the results.

Upvotes: 0

Related Questions