How to identify more than one list item in a for loop

Question

I'm trying to identify keywords in a DataFrame column, then create new binary columns when the keywords are identified. The reproducible example below works when the individual string lists contain just a single keyword to identify; it also highlights at stage 2 where the problem lies.

The problem is I want to add more asssociated keywords to each list, so that associated terms can effectively be categorised into the new columns. However, when I add more than one keyword to the lists I get ValueError: Length of values does not match length of index.

# 1. Create dataframe
test = {'comment': ['my pay review was not enough',
                    'my annual bonus was too low, I need more pay',
                    'my pay is too low', 'my bonus is huge', 'better pay please'],
        'team': ['team1', 'team2', 'team3', 'team1', 'team2']}

test = pd.DataFrame(test)

# 2. create string lists - (these are the lists I want to add multiple associated keywords too) 
pay_strings = ['pay']
bonus_strings = ['bonus']

# 3. Create empty lists
pay_col = []
bonus_col = []

# 4. Loop through `comment` column to identify words and represent them in the new lists with binary values

for row in test['comment']:
    for pay in pay_strings:
        if pay in row:
            pay_col.append(1)
        elif pay not in row:
            pay_col.append(0)

    for bonus in bonus_strings:
        if bonus in row:
            bonus_col.append(1)
        elif bonus not in row: 
            bonus_col.append(0)          

# 5. Add new lists to dataframe

test['pay'] = pay_col
test['bonus'] = bonus_col
test

# 6. Resulting dataframe
    comment                                       team    pay   bonus
0   my pay review was not enough                  team1   1     0
1   my annual bonus was too low, I need more pay  team2   1     1
2   my pay is too low                             team3   1     0
3   my bonus is huge                              team1   0     1
4   better pay please                             team2   1     0

Is there a way to effectively look up multiple items in lists, or is there a better way to do this?

M Hart · Accepted Answer

As written, when you add additional keywords, the length of the resulting pay_col list exceeds the number of rows in your dataframe which is causing the referenced error.

Modify this code block:

for row in test['comment']:
    for pay in pay_strings:
        if pay in row:
            pay_col.append(1)
        elif pay not in row:
            pay_col.append(0)

to either maintain a unique count for each keyword (in which case you will have a column for each of the keywords in your pay_string keyword list) or modify to increment the count for each row (i.e., comment) if a match has already been identified.

How to identify more than one list item in a for loop

Answers (1)

Related Questions