Reputation: 4122
I'm trying to identify keywords in a DataFrame column, then create new binary columns when the keywords are identified. The reproducible example below works when the individual string lists contain just a single keyword to identify; it also highlights at stage 2 where the problem lies.
The problem is I want to add more asssociated keywords to each list, so that associated terms can effectively be categorised into the new columns. However, when I add more than one keyword to the lists I get ValueError: Length of values does not match length of index
.
# 1. Create dataframe
test = {'comment': ['my pay review was not enough',
'my annual bonus was too low, I need more pay',
'my pay is too low', 'my bonus is huge', 'better pay please'],
'team': ['team1', 'team2', 'team3', 'team1', 'team2']}
test = pd.DataFrame(test)
# 2. create string lists - (these are the lists I want to add multiple associated keywords too)
pay_strings = ['pay']
bonus_strings = ['bonus']
# 3. Create empty lists
pay_col = []
bonus_col = []
# 4. Loop through `comment` column to identify words and represent them in the new lists with binary values
for row in test['comment']:
for pay in pay_strings:
if pay in row:
pay_col.append(1)
elif pay not in row:
pay_col.append(0)
for bonus in bonus_strings:
if bonus in row:
bonus_col.append(1)
elif bonus not in row:
bonus_col.append(0)
# 5. Add new lists to dataframe
test['pay'] = pay_col
test['bonus'] = bonus_col
test
# 6. Resulting dataframe
comment team pay bonus
0 my pay review was not enough team1 1 0
1 my annual bonus was too low, I need more pay team2 1 1
2 my pay is too low team3 1 0
3 my bonus is huge team1 0 1
4 better pay please team2 1 0
Is there a way to effectively look up multiple items in lists, or is there a better way to do this?
Upvotes: 1
Views: 78
Reputation: 146
As written, when you add additional keywords, the length of the resulting pay_col list exceeds the number of rows in your dataframe which is causing the referenced error.
Modify this code block:
for row in test['comment']:
for pay in pay_strings:
if pay in row:
pay_col.append(1)
elif pay not in row:
pay_col.append(0)
to either maintain a unique count for each keyword (in which case you will have a column for each of the keywords in your pay_string keyword list) or modify to increment the count for each row (i.e., comment) if a match has already been identified.
Upvotes: 1