Reputation: 117
I have two dataframes:
the first is called Roster:
| u_id | Skills |
|------|------------------------------------------------|
| 1 | ai, deep learning, machine learning, nlp |
| 2 | computer vision, statistics, python, css |
| 3 | development, keras, numpy, supervised learning |
the second is called Taxonomy:
| Skill_ID | Skill_Name |
|---------------------------|------------|
| AI | some |
| Computer Engineering | other |
| Machine Learning | data |
| Statistics | here |
| Robotics | blah |
| Exploratory Data Analysis | blah |
I need look inside Roster["Skills"] for any Skill_ID from Taxonomy["Skill_ID"]. There will likely be several matches so to handle that I want to aggregate all matches in a list within an adjacent cell in the Roster dataFrame.
I started by tokenizing Roster["Skills"] but then realized I would miss all compound words such as "computer engineering". I suppose I could lower case all data, then write a search without any other NLP but having trouble with the code.
latest attempt:
all_skills = []
for row in roster_skills:
for skill in taxonomy_skill_id:
if skill in row:
all_skills.append(skill)
skills_list_len = range(len(all_skills))
for n in skills_list_len:
roster1[n] = all_skills
ValueError: Length of values does not match length of index
Upvotes: 0
Views: 75
Reputation: 16683
I created cleaner dataframes for Roster and Taxonomy, so that they can be easily run. Then, I created a list object column called all_skills. From there, start your loop with zip()
, so that you can iterate through multiple columns simultaneously, so that you can later call i to get the row by subtracting 1 from Roster['u_id'] in Roster.loc[i-1,'all_skills'].append(k)
Use .append()
to append k i.e. skill.lower() in the column Taxonomy['Skill_ID'] if it is in j in the column Roster['Skills']. You have to use .loc
to locate the row and column that you need to append to.
import pandas as pd
Roster = pd.DataFrame({'u_id': {0: 1, 1: 2, 2: 3},
'Skills': {0: 'ai, deep learning, machine learning, nlp',
1: 'computer vision, statistics, python, css',
2: 'development, keras, numpy, supervised learning'}})
Taxonomy = pd.DataFrame({'Skill_ID': {0: 'AI',
1: 'Computer Engineering',
2: 'Machine Learning',
3: 'Statistics',
4: 'Robotics',
5: 'Exploratory Data Analysis'},
'Skill_Name': {0: 'some',
1: 'other',
2: 'data',
3: 'here',
4: 'blah',
5: 'blah'}})
Roster['all_skills'] = ''
Roster['all_skills'] = Roster['all_skills'].apply(list)
for i, j in zip(Roster['u_id'], Roster['Skills']):
for skill in Taxonomy['Skill_ID']:
k = skill.lower()
if k in j:
print(k)
Roster.loc[i-1,'all_skills'].append(k)
Output:
u_id Skills all_skills
0 1 ai, deep learning, machine learning, nlp [ai, machine learning]
1 2 computer vision, statistics, python, css [statistics]
2 3 development, keras, numpy, supervised learning []
Upvotes: 1