Reputation: 31

How to use pandas string contains (str.contain) in a loop?

I have a pandas df with a column A, which is a string of strings. Each item in the series (i.e. each row in the database) is just one long string, separated by commas. I would like to create a new column called B that incrementally counts everytime an object from a separate list appears in each row of column A. For instance:

my list looks like this:

list = ('dog', 'bird', 'cat')

my dataframe looks like this:

A                           B
dog, bird                   2
cat, bird                   2
dog, snake                  1
cat, bird, snake            2
dog, bird, cat, snake       3
dog, bird cat               3

i'm trying to create a nested loop that does the following: start at df.a[0] (i.e. the first value of df.A), find out if it contains the first value of the list (i.e. 'dog'). If df.A[0] contains, then add 1 to B. then, staying in the same row of df.A, move on to the second value of the list (i.e. 'bird'). If that df.A[0] also contains this value, then add another 1 to B. etc etc.

This is the code I'm trying to use.

for i in df['A']:
    for j in list:
        if i.str.contains(j):
            df['B'] += 1

However, I keep getting the error:

'str' object has no attribute 'str'

How can I tell pandas to look at the entire series, while also telling it to function as a loop with the structure identified above? Or alternatively, what is the best way to solve this problem?

Upvotes: 3

Answers (2)

BENY

Reputation: 323226

Another way to achieve what you need involved get_dummies

df.A.str.get_dummies(', ').loc[:,lst].sum(1)
Out[849]: 
0    2
1    2
2    1
3    2
4    3
5    3
dtype: int64

Or we just do str.split with isin

df.A.str.split(', ',expand=True).isin(lst).sum(1)
Out[853]: 
0    2
1    2
2    1
3    2
4    3
5    3
dtype: int64

Upvotes: 4

cs95

Reputation: 402483

A couple of notes -

Don't iterate over a DataFrame when you can avoid it. Always look to vectorize. If you cannot, only then use a list comprehension
When iterating over the column, you're iterating over individual string items. Those do not have an .str attribute.
Don't use list and other similar names (dict, tuple) to name variables/objects, they shadow the builtin. I've renamed your variable to substr below.

A KISS solution would involve str.findall + str.len. No splitting required.

substr = ('dog', 'bird', 'cat')
df['B'] = df['A'].str.findall('|'.join(substr)).str.len()

df['B']

0    2
1    2
2    1
3    2
4    3
5    3
Name: A, dtype: int64

If you have large strings and a lot of substrings, you may want to look at using the Aho-Corasick algorithm.

Upvotes: 6

How to use pandas string contains (str.contain) in a loop?

Answers (2)

Related Questions