Reputation: 67
I have a dataframe composed of text job descriptions, and 3 empty columns
index job_description level_1 level_2 level_3
0 this job requires masters in.. 0 0 0
1 bachelor degree needed for.. 0 0 0
2 ms is preferred or phd.. 0 0 0
I'm trying to go through each job description string and count the frequency of each degree level that was mentioned in the job description. A sample output should look like this.
index job_description level_1 level_2 level_3
0 this job requires masters in.. 0 1 0
1 bachelor degree needed for.. 1 0 0
2 ms is preferred or phd.. 0 1 1
I created the dictionaries to do the comparison as seen below, but I'm somewhat clueless on how I can look for those words in the strings of the dataframe "job description" column and populate the dataframe columns depending on whether the words exist or not.
my_dict_1 = dict.fromkeys(['bachelors', 'bachelor', 'ba','science
degree','bs','engineering degree'], 1)
my_dict_2 = dict.fromkeys(['masters', 'ms', 'master'], 1)
my_dict_3 = dict.fromkeys(['phd','p.h.d'], 1)
I really appreciate the support on this..
Upvotes: 3
Views: 1121
Reputation: 21729
I think we can do like this:
# create a level based mapper dict
mapper = {'level_1':['bachelors', 'bachelor', 'ba','science degree','bs','engineering degree'],
'level_2': ['masters', 'ms', 'master'],
'level_3': ['phd','p.h.d']}
# convert list to set
mapper = {k:set(v) for k,v in mapper.items}
# remove dots from description
df['description'] = df['description'].str.replace('.','')
# check if any word of description is available in the mapper dict
df['flag'] = df['description'].str.split(' ').apply(set).apply(lambda x: [k for k,v in mapper.items() if any([y for y in x if y in v])])
# convert the list into new rows
df1 = df.set_index(['index','description'])['flag'].apply(pd.Series).stack().reset_index().drop('level_2', axis=1)
df1.rename(columns={0:'flag'}, inplace=True)
# add a flag column , this value will be use as filler
df1['val'] = 1
# convert the data into wide format
df1 = df1.set_index(['index','description','flag'])['val'].unstack(fill_value=0).reset_index()
df1.columns.name = None
print(df1)
index description level_1 level_2 level_3
0 0 this job requires masters in 0 1 0
1 1 bachelor degree needed for 0 1 0 0
2 2 ms is preferred or phd 0 1 1
Upvotes: 1
Reputation: 11602
A slightly different approach is to store keywords and job descriptions as sets, and then compute set intersections. You could generate the intersection matrix compactly by vectorizing set.intersection
:
import pandas as pd
df = pd.read_csv(
pd.compat.StringIO(
""" index job_description level_1 level_2 level_3
0 this job requires masters in.. 0 0 0
1 bachelor degree needed for.. 0 0 0
2 ms is preferred or phd .. 0 0 0"""
),
sep=r" +",
)
levels = pd.np.array(
[
{"bachelors", "bachelor", "ba", "science degree", "bs", "engineering degree"},
{"masters", "ms", "master"},
{"phd", "p.h.d"},
]
)
df[["level_1", "level_2", "level_3"]] = (
pd.np.vectorize(set.intersection)(
df.job_description.str.split().apply(set).values[:, None], levels
)
.astype(bool)
.astype(int)
)
index job_description level_1 level_2 level_3
0 0 this job requires masters in.. 0 1 0
1 1 bachelor degree needed for.. 1 0 0
2 2 ms is preferred or phd .. 0 1 1
Upvotes: 2
Reputation: 12515
How about something like this?
Since each of your three dictionaries correspond to different columns you want to create, we can create another dictionary mapping with the soon-to-be-column names as keys, and the strings to search for at each particular level as values (really, you don't even need a dictionary for storing the my_dict_<x>
items - you could use a set
instead - but it's not a huge deal):
>>> lookup = {'level_1': my_dict_1, 'level_2': my_dict_2, 'level_3': my_dict_3}
>>> lookup
{'level_1': {'bachelors': 1, 'bachelor': 1, 'ba': 1, 'science degree': 1, 'bs': 1, 'engineering degree': 1}, 'level_2': {'masters': 1, 'ms': 1, 'master': 1}, 'level_3': {'phd': 1, 'p.h.d': 1}}
Then, go through each proposed column in the dictionary you just created and assign a new column which creates the output you want, checking for each level specified in each my_dict_<x>
object whether at least one belongs in the job description in each row...
>>> for level, values in lookup.items():
... df[level] = df['job_description'].apply(lambda x: 1 if any(v in x for v in values) else 0)
...
>>> df
job_description level_1 level_2 level_3
0 masters degree required 0 1 0
1 bachelor's degree required 1 0 0
2 bachelor degree required 1 0 0
3 phd required 0 0 1
Another solution, using scikit-learn's CountVectorizer class, which counts the frequencies of tokens (words, basically) occurring in strings:
>>> from sklearn.feature_extraction.text import CountVectorizer
Specify a particular vocabulary - forget about all other words that aren't "academic credential" keywords:
>>> vec = CountVectorizer(vocabulary={value for level, values in lookup.items() for value in values})
>>> vec.vocabulary
{'master', 'p.h.d', 'ba', 'ms', 'engineering degree', 'masters', 'phd', 'bachelor', 'bachelors', 'bs', 'science degree'}
Fit that transformer to the text iterable, df['job_description']
:
>>> result = vec.fit_transform(df['job_description'])
Taking a deeper look at the results:
>>> pd.DataFrame(result.toarray(), columns=vec.get_feature_names())
ba bachelor bachelors bs engineering degree master masters ms p.h.d phd science degree
0 0 0 0 0 0 0 1 0 0 0 0
1 0 1 0 0 0 0 0 0 0 0 0
2 0 1 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 1 0
This last approach might require a bit more work if you want to get back to your level_<x>
column structure, but I thought I'd just show it as a different way of thinking about encoding those datapoints.
Upvotes: 2