Reputation: 585
I am working with a dataframe of that kind.
num person cash
0 1 personone 29
1 1 persontwo 81
2 1 personone: 17
3 1 personone 75
4 1 personone and persontwo 62
5 1 personone's friend 55
I am using the following code to filter a dataframe based on vector of strings.
people = ["personone", "persontwo"]
pattern = r"\b{}\b".format("|".join(people))
df[df["person"].str.match(pattern)]
The problem is that the code produces a bit more than I need. When I check for df["person"].unique()
it gives me an array of:
array(["personone", "persontwo", "personone:", "personone ", "personone and persontwo", "personone's friend"], dtype=object)
Even though I used the word boundaries \b
.
The outcome I would like to achieve is: combine personone
, personone:
and personone
(the last one with the space at the end). And produce the output of:
num person cash
0 1 personone 121
1 1 persontwo 81
I.e. combine the three variants of personone
and disregard its all other appearances. The cash
for personone
is the sum of 29+17+75=121.
Upvotes: 1
Views: 63
Reputation: 15872
You can use ^
and $
to limit the matches:
>>> people = ["personone", "persontwo"]
>>> patt = fr"^({'|'.join(people)}).?$"
>>> (
df.groupby(df.person.str.extract(patt, expand=False))
.agg(cash=('cash', 'sum'), num=('num', 'first'))
.reset_index().reindex(df.columns, axis=1)
)
num person cash
0 1 personone 121
1 1 persontwo 81
Upvotes: 0
Reputation: 19405
what you are asking is not entirely clear. For instance, why are you dropping the case personone and persontwo
?
Anyway, one way to proceed is to create a flag
variable that flags the good observations, such as
df['flag'] = df.person.str.contains('personone:?$')
and then you simply sum
df.loc[df.flag == True, 'cash'].sum()
Upvotes: 0
Reputation: 150815
One option is to match with optional \W
instead of \b
and force begin and end string:
people = ["personone", "persontwo"]
pattern = r"^\W?({})\W?$".format("|".join(people))
s = df["person"].str.extract(pattern,expand=False)
df[s.notna()].groupby(['num',s])['cash'].sum()
Output:
num person
1 personone 121
persontwo 81
Name: cash, dtype: int64
Upvotes: 1