Reputation: 113
I have a dataframe looks like this:
import pandas as pd
df = pd.DataFrame({'type_a': [1,0,0,0,0,1,0,0,0,1],
'type_b': [0,1,0,0,0,0,0,0,1,1],
'type_c': [0,0,1,1,1,1,0,0,0,0],
'type_d': [1,0,0,0,0,1,1,0,1,0],
})
I wanna create a new column based on those 4 columns, it will return the column names whenever the value in those 4 columns equals to 1, if there are multiple columns equal to 1 at the same time then it will return the list of those columns names, otherwise it will be nan.
The output dataframe will look like this:
df = pd.DataFrame({'type_a': [1,0,0,0,0,1,0,0,0,1],
'type_b': [0,1,0,0,0,0,0,0,1,1],
'type_c': [0,0,1,1,1,1,0,0,0,0],
'type_d': [1,0,0,0,0,1,1,0,1,0],
'type':[['type_a','type_d'], 'type_b', 'type_c', 'type_c','type_c', ['type_a','type_c','type_d'], 'type_d', 'nan', ['type_b','type_d'],['type_a','type_b']]
})
Any help will be really appreciated. Thanks!
Upvotes: 3
Views: 119
Reputation: 21938
This is also another way:
import pandas as pd
df['type'] = (pd.melt(df.reset_index(), id_vars='index')
.query('value == 1')
.groupby('index')['variable']
.apply(list))
type_a type_b type_c type_d type
0 1 0 0 1 [type_a, type_d]
1 0 1 0 0 [type_b]
2 0 0 1 0 [type_c]
3 0 0 1 0 [type_c]
4 0 0 1 0 [type_c]
5 1 0 1 1 [type_a, type_c, type_d]
6 0 0 0 1 [type_d]
7 0 0 0 0 NaN
8 0 1 0 1 [type_b, type_d]
9 1 1 0 0 [type_a, type_b]
Upvotes: 4
Reputation: 13447
You can use this answer and adapt to your case.
import pandas as pd
df = pd.DataFrame({'type_a': [1,0,0,0,0,1,0,0,0,1],
'type_b': [0,1,0,0,0,0,0,0,1,1],
'type_c': [0,0,1,1,1,1,0,0,0,0],
'type_d': [1,0,0,0,0,1,1,0,1,0],
})
df['type'] = df.dot(df.columns + ',')\
.str.rstrip(',')\
.apply(lambda x: x.split(','))
Where the output is
type_a type_b type_c type_d type
0 1 0 0 1 [type_a, type_d]
1 0 1 0 0 [type_b]
2 0 0 1 0 [type_c]
3 0 0 1 0 [type_c]
4 0 0 1 0 [type_c]
5 1 0 1 1 [type_a, type_c, type_d]
6 0 0 0 1 [type_d]
7 0 0 0 0 []
8 0 1 0 1 [type_b, type_d]
9 1 1 0 0 [type_a, type_b]
The general case will be
df['type'] = df.eq(1).dot(df.columns + ',')\
.str.rstrip(',')\
.apply(lambda x: x.split(','))
Eventually you can avoid lambda (in case your dataframe is big)
df['type'] = df.eq(1).dot(df.columns + ',')\
.str.rstrip(',')\
.str.split(',')
Here I want to compare few solutions proposed here.
import pandas as pd
import numpy as np
n = 10_000
columns = ['type_a', 'type_b', 'type_c', 'type_d']
# set seed for reproducibility
np.random.seed(0)
df = pd.DataFrame(
np.random.randint(2, size=(n, 4)),
columns=columns)
# save copy of original data
df_bk = df.copy()
As we are going to load the data using timeit
we want to know how long it takes.
%%timeit -n 10 -r 10
df = df_bk.copy()
142 µs ± 40.4 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)
%%timeit -n 10 -r 10
df = df_bk.copy()
df['type'] = df.apply(lambda x:
df.columns[x.eq(1)].tolist(), axis=1)
782 ms ± 33.2 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
%%timeit -n 10 -r 10
df = df_bk.copy()
df['type'] = df.mul(df.columns)\
.apply(lambda x: list(pd.Series(i for i in x if len(i)>0)), axis=1)
619 ms ± 22.1 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
%%timeit -n 10 -r 10
df = df_bk.copy()
df['type'] = (pd.melt(df.reset_index(), id_vars='index')
.query('value == 1')
.groupby('index')['variable']
.apply(lambda x:[str for str in x]))
148 ms ± 12.6 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
%%timeit -n 10 -r 10
df = df_bk.copy()
df['type'] = df.eq(1).dot(df.columns + ',')\
.str.rstrip(',')\
.str.split(',')
13 ms ± 2.61 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
As you can see from the following image (please click it to expand) the accepted solution is ways faster than others. Yet the vectorial solution suggested here manages to be 11x faster than the accepted solution.
Upvotes: 2
Reputation: 598
And another one:
import pandas as pd
import numpy as np
df = pd.DataFrame({'type_a': [1,0,0,0,0,1,0,0,0,1],
'type_b': [0,1,0,0,0,0,0,0,1,1],
'type_c': [0,0,1,1,1,1,0,0,0,0],
'type_d': [1,0,0,0,0,1,1,0,1,0],
})
df['type']=''
for i,r in df.iterrows():
t=[k for k in r.keys() if r[k]==1]
if t:
if len(t)==1:
df.at[i,'type']=t[0]
else:
df.at[i,'type']=t
else:
df.at[i,'type']=np.nan
Upvotes: 0
Reputation: 11650
here is one more way
df.mul(df.columns).apply(lambda x: list(pd.Series(i for i in x if len(i)>0)), axis=1)
0 [type_a, type_d]
1 [type_b]
2 [type_c]
3 [type_c]
4 [type_c]
5 [type_a, type_c, type_d]
6 [type_d]
7 []
8 [type_b, type_d]
9 [type_a, type_b]
Upvotes: 0