Alejandro Simkievich
Alejandro Simkievich

Reputation: 3792

python - pass dataframe column as argument in apply function

I have the following dataframe:

In[1]: df = DataFrame({"A": ['I love cooking','I love rowing'], "B": [['cooking','rowing'],['cooking','rowing']]})

Thus the output that I get is:

In[2]: df
Out[1]: 
            A                  B
0  I love cooking  [cooking, rowing]
1   I love rowing  [cooking, rowing]

I want to create a 'C' column where I count the number of occurrences of elements of 'B' in 'A'.

The function I create is:

def count_keywords(x,y):
    a = 0
    for element in y:
        if element in x:
            a += 1
return a

and then do:

df['A'].apply(count_keywords,args=(df['B'],))

In this case, I am passing the entire pandas dataseries as argument, so the element of the dataseries df['B'] is obviously a list, not a string (which in turn is the element of the list).

So I get:

TypeError: 'in <string>' requires string as left operand, not list

However, if I adjust the function so that:

def count_keywords(x,y): 
    a = 0
    for element in y:
        for new_element in element:
            if new_element in x:
                a += 1
    return a

and then do:

In[3]: df['A'].apply(count_keywords,args=(df['B'],))

the output is:

Out[2]: 
0    2
1    2

Because the function loops through every element in the pandas series and then through every element in the list.

How can I get the function to just check, per dataframe row, the element of series df['B'] against the element in series df['A'], so the output is:?

Out[2]: 
0    1
1    1

Thanks a lot!

Upvotes: 0

Views: 7184

Answers (2)

maxymoo
maxymoo

Reputation: 36545

Another way you could do this is by using a set intersection to calculate the size. In theory this may be faster then iterating over the elements, since set is sort of designed for this kind of thing:

df['C'] = df.apply(lambda x: len(set(x.B).intersection(set(x.A.split()))), axis = 1)

Upvotes: 2

vmg
vmg

Reputation: 4326

You have to apply over the other axis.

def count_keywords(row): 
    counter = 0
    for e in row['B']:
        if e in row['A']:
            counter += 1
    row['C'] = counter
    return row

df2 = df.apply(count_keywords,axis=1)

Gives you:

           A                B           C
0   I love cooking  [cooking, rowing]   1
1   I love rowing   [cooking, rowing]   1

Then df2['C'] should give you the 1,1 series you mention.

Upvotes: 2

Related Questions