Matan
Matan

Reputation: 117

Find (space separated) compound words in a DataFrame

I have two dataframes:

the first is called Roster:

| u_id | Skills                                         |
|------|------------------------------------------------|
| 1    | ai, deep learning, machine learning, nlp       |
| 2    | computer vision, statistics, python, css       |
| 3    | development, keras, numpy, supervised learning |

the second is called Taxonomy:

| Skill_ID                  | Skill_Name |
|---------------------------|------------|
| AI                        | some       |
| Computer Engineering      | other      |
| Machine Learning          | data       |
| Statistics                | here       |
| Robotics                  | blah       |
| Exploratory Data Analysis | blah       |

I need look inside Roster["Skills"] for any Skill_ID from Taxonomy["Skill_ID"]. There will likely be several matches so to handle that I want to aggregate all matches in a list within an adjacent cell in the Roster dataFrame.

I started by tokenizing Roster["Skills"] but then realized I would miss all compound words such as "computer engineering". I suppose I could lower case all data, then write a search without any other NLP but having trouble with the code.

latest attempt:

all_skills = []

for row in roster_skills:
    for skill in taxonomy_skill_id:
        if skill in row:
            all_skills.append(skill)
            skills_list_len = range(len(all_skills))
            for n in skills_list_len:
                roster1[n] = all_skills

ValueError: Length of values does not match length of index

Upvotes: 0

Views: 75

Answers (1)

David Erickson
David Erickson

Reputation: 16683

I created cleaner dataframes for Roster and Taxonomy, so that they can be easily run. Then, I created a list object column called all_skills. From there, start your loop with zip(), so that you can iterate through multiple columns simultaneously, so that you can later call i to get the row by subtracting 1 from Roster['u_id'] in Roster.loc[i-1,'all_skills'].append(k) Use .append() to append k i.e. skill.lower() in the column Taxonomy['Skill_ID'] if it is in j in the column Roster['Skills']. You have to use .loc to locate the row and column that you need to append to.

import pandas as pd

Roster = pd.DataFrame({'u_id': {0: 1, 1: 2, 2: 3},
 'Skills': {0: 'ai, deep learning, machine learning, nlp',
  1: 'computer vision, statistics, python, css',
  2: 'development, keras, numpy, supervised learning'}})

Taxonomy = pd.DataFrame({'Skill_ID': {0: 'AI',
  1: 'Computer Engineering',
  2: 'Machine Learning',
  3: 'Statistics',
  4: 'Robotics',
  5: 'Exploratory Data Analysis'},
 'Skill_Name': {0: 'some',
  1: 'other',
  2: 'data',
  3: 'here',
  4: 'blah',
  5: 'blah'}})

Roster['all_skills'] = ''
Roster['all_skills'] = Roster['all_skills'].apply(list)

for i, j in zip(Roster['u_id'], Roster['Skills']):
    for skill in Taxonomy['Skill_ID']:
        k = skill.lower()
        if k in j:
            print(k)
            Roster.loc[i-1,'all_skills'].append(k)      

Output:

    u_id    Skills                                          all_skills
0   1       ai, deep learning, machine learning, nlp        [ai, machine learning]
1   2       computer vision, statistics, python, css        [statistics]
2   3       development, keras, numpy, supervised learning  []

Upvotes: 1

Related Questions