Reputation: 99
I have a dataframe like:
df = pd.DataFrame({'id':[1,2,3,4,5,6,7],
'vote':[5,4,5,1,10,1,9],
'doggo': [None,"doggo",None,None,"doggo",None,None],
'floofer': ["floofer",None,None,"floofer",None,None,None],
'pupper': [None,None,"pupper",None,None,None,None],
'puppo':[None,None,None,None,None,None,"puppo"]})
I want to combine last 4 columns and gnerate:
df = pd.DataFrame({'id':[1,2,3,4,5,6,7],
'vote':[5,4,5,1,10,1,9],
'categories': ["floofer","doggo","pupper","floofer","doggo",None,"puppo"]})
Any guidance is appreciated.
Upvotes: 3
Views: 97
Reputation: 862511
Solution if each row has only one not None
value per categorical columns:
cols = ['doggo','floofer','pupper','puppo']
cols1 = df.columns.difference(cols)
df2 = df[cols1].join(df[cols].ffill(axis=1).iloc[:, -1].rename('Categories'))
print (df2)
id vote Categories
0 1 5 floofer
1 2 4 doggo
2 3 5 pupper
3 4 1 floofer
4 5 10 doggo
5 6 1 None
6 7 9 puppo
Explanation:
First select only columns with catagorical data and forward fill missing values - expected data are in last column:
print (df[cols].ffill(axis=1))
doggo floofer pupper puppo
0 None floofer floofer floofer
1 doggo doggo doggo doggo
2 None None pupper pupper
3 None floofer floofer floofer
4 doggo doggo doggo doggo
5 None None None None
6 None None None puppo
Select last column by position:
print (df[cols].ffill(axis=1).iloc[:, -1])
0 floofer
1 doggo
2 pupper
3 floofer
4 doggo
5 None
6 puppo
Name: puppo, dtype: object
Solution if multiple values - data are created from columns names of categorical columns:
df = pd.DataFrame({'id':[1,2,3,4,5,6,7],
'vote':[5,4,5,1,10,1,9],
'doggo': [None,"doggo1",None,"doggo2","doggo3",None,None],
'floofer': ["floofer1",None,None,"floofer2",None,None,None],
'pupper': [None,None,"pupper1",None,None,None,None],
'puppo':["puppo1",None,None,None,None,None,"puppo2"]})
print (df)
id vote doggo floofer pupper puppo
0 1 5 None floofer1 None puppo1
1 2 4 doggo1 None None None
2 3 5 None None pupper1 None
3 4 1 doggo2 floofer2 None None
4 5 10 doggo3 None None None
5 6 1 None None None None
6 7 9 None None None puppo2
s = (df[cols].notnull()
.dot(pd.Index(cols) + ', ')
.str.strip(', ')
.rename('Categories')
.replace('', np.nan)
)
df = df[cols1].join(s)
print (df)
id vote Categories
0 1 5 floofer, puppo
1 2 4 doggo
2 3 5 pupper
3 4 1 doggo, floofer
4 5 10 doggo
5 6 1 NaN
6 7 9 puppo
Another solution, expected output is not from columns names:
s = pd.Series(df[cols].add(', ').fillna('').values.sum(axis=1),
index=df.index, name='Categories').str.strip(', ')
df = df[cols1].join(s)
print (df)
id vote Categories
0 1 5 floofer1, puppo1
1 2 4 doggo1
2 3 5 pupper1
3 4 1 doggo2, floofer2
4 5 10 doggo3
5 6 1
6 7 9 puppo2
Upvotes: 1
Reputation: 10211
We can exploit the fact that x or None
is x
and use Numpy's logical_or operator reduced/applied along each row to get the category.
import numpy as np
cols = ['doggo','floofer','pupper','puppo']
categories = np.logical_or.reduce(df[cols], axis=1)
df = df.assign(categories=categories).drop(cols, axis=1)
Upvotes: 0
Reputation: 402303
bfill
+ iloc
You can bfill
(backfill) and select the first column:
(df.set_index(['id', 'vote'])
.bfill(axis=1)
.iloc[:, 0]
.reset_index(name='Categories'))
id vote Categories
0 1 5 floofer
1 2 4 doggo
2 3 5 pupper
3 4 1 floofer
4 5 10 doggo
5 6 1 None
6 7 9 puppo
stack
+ reindex
cats = (df.drop(['id', 'vote'], 1).stack()
.reset_index(level=1, drop=True).reindex(df.index))
pd.DataFrame(dict(id=df.id, vote=df.vote, Categories=cats))
id vote Categories
0 1 5 floofer
1 2 4 doggo
2 3 5 pupper
3 4 1 floofer
4 5 10 doggo
5 6 1 NaN
6 7 9 puppo
last_valid_index
Slow, but succinct.
(df.set_index(['id', 'vote'])
.agg(lambda x: x.last_valid_index(), axis=1)
.reset_index(name='Categories'))
id vote Categories
0 1 5 floofer
1 2 4 doggo
2 3 5 pupper
3 4 1 floofer
4 5 10 doggo
5 6 1 None
6 7 9 puppo
Assuming "id" and "vote" are the only non-categorical columns.
Upvotes: 1