Reputation: 784
I have the following (example) dataset:
>>> pd.DataFrame([["001", "Apple"],["002","Strawberry"],["001", None],["002","Strawberry"], ["003", "Apple"],["003","Strawberry"],], columns = ["Deal", "Product"])
Deal Product
0 001 Apple
1 002 Strawberry
2 001
3 002 Strawberry
4 003 Apple
5 003 Strawberry
I would like to group in a set all the products linked to an order, like this:
Deal Product
0 001 {Apple}
1 002 {Strawberry}
2 003 {Strawberry, Apple}
I have an attempted solution as one of the answers below, I would like to understand if I am doing it in the right (pythonic, fastest) way
Upvotes: 0
Views: 55
Reputation: 20669
You can use SeriesGroupBy.agg
with pd.Series.dropna
with set
.
To delete ''
just use set difference here i.e {1,2,3} - {3} = {1,2}
. So, set(x.dropna())-{''}
would give a set without ''
after dropping all missing values.
df.groupby('Deal')['Product'].agg(lambda x:set(x.dropna())-{''})
Deal
001 {Apple}
002 {Strawberry}
003 {Strawberry, Apple}
Name: Product, dtype: object
Upvotes: 1
Reputation: 22503
You can use notnull
:
print (df.loc[df['Product'].notnull()].groupby("Deal")["Product"].apply(set))
Deal
001 {Apple}
002 {Strawberry}
003 {Strawberry, Apple}
Use df.loc[~df['Product'].isin([None,""])]...
if you want to take care of both None
and ''
. Or use the methods from other answers for filtering.
Upvotes: 2
Reputation: 862681
Idea is remove missing values with chained Series.notna
with test no empty strings by &
for bitwise AND
:
mask = df['Product'].notna() & df['Product'].ne('')
df = df[mask].groupby('Deal')['Product'].agg(set).reset_index()
print (df)
Deal Product
0 001 {Apple}
1 002 {Strawberry}
2 003 {Apple, Strawberry}
Upvotes: 2
Reputation: 784
I started from this answer to solve my problem
# Turn every element in product in a set of one or zero elements
# Also Ensure we don't have null values
df["Product"] = df["Product"].apply(lambda val: {val} if val not in [None, ""] else {})
#Use the answer mentioned to bring together the single sets
df.groupby('Deal').agg({'Product':lambda x: set.union(*x)}).reset_index('Deal')
Final result:
>>> df
Deal Product
0 001 {Apple}
1 002 {Strawberry}
2 003 {Strawberry, Apple}
Upvotes: 0