Reputation: 31
I have a pandas df with a column A, which is a string of strings. Each item in the series (i.e. each row in the database) is just one long string, separated by commas. I would like to create a new column called B that incrementally counts everytime an object from a separate list appears in each row of column A. For instance:
my list looks like this:
list = ('dog', 'bird', 'cat')
my dataframe looks like this:
A B
dog, bird 2
cat, bird 2
dog, snake 1
cat, bird, snake 2
dog, bird, cat, snake 3
dog, bird cat 3
i'm trying to create a nested loop that does the following: start at df.a[0] (i.e. the first value of df.A), find out if it contains the first value of the list (i.e. 'dog'). If df.A[0] contains, then add 1 to B. then, staying in the same row of df.A, move on to the second value of the list (i.e. 'bird'). If that df.A[0] also contains this value, then add another 1 to B. etc etc.
This is the code I'm trying to use.
for i in df['A']:
for j in list:
if i.str.contains(j):
df['B'] += 1
However, I keep getting the error:
'str' object has no attribute 'str'
How can I tell pandas to look at the entire series, while also telling it to function as a loop with the structure identified above? Or alternatively, what is the best way to solve this problem?
Upvotes: 3
Views: 8219
Reputation: 323226
Another way to achieve what you need involved get_dummies
df.A.str.get_dummies(', ').loc[:,lst].sum(1)
Out[849]:
0 2
1 2
2 1
3 2
4 3
5 3
dtype: int64
Or we just do str.split
with isin
df.A.str.split(', ',expand=True).isin(lst).sum(1)
Out[853]:
0 2
1 2
2 1
3 2
4 3
5 3
dtype: int64
Upvotes: 4
Reputation: 402483
A couple of notes -
.str
attribute.list
and other similar names (dict
, tuple
) to name variables/objects, they shadow the builtin. I've renamed your variable to substr
below.A KISS solution would involve str.findall
+ str.len
. No splitting required.
substr = ('dog', 'bird', 'cat')
df['B'] = df['A'].str.findall('|'.join(substr)).str.len()
df['B']
0 2
1 2
2 1
3 2
4 3
5 3
Name: A, dtype: int64
If you have large strings and a lot of substrings, you may want to look at using the Aho-Corasick algorithm.
Upvotes: 6