Reputation: 43
I have the following dataframe.
data = [[0, 0, 0, 0, 1], [0, 1, 0, 0, 1], [1, 1, 0, 0, 0], [0, 1, 0, 0, 0]]
labels = ['cat', 'dog', 'duck', 'fish', 'horse']
df = pd.DataFrame(data, columns = labels)
df:
cat dog duck fish horse
0 0 0 0 0 1
1 0 1 0 0 1
2 1 1 0 0 0
3 0 1 0 0 0
I have got the 0's and 1's from another dataframe based on a certain condition. I want to combine the column names for values corresponding to true values. i.e 1's into a list and put it into a new column at the end of the dataframe.
I want my result to look like this.
cat dog duck fish horse result
0 0 0 0 0 1 [horse]
1 0 1 0 0 1 [dog, horse]
2 1 1 0 0 0 [cat, dog]
3 0 1 0 0 0 [dog]
I have about 108 columns in my original dataframe and around 3500 rows. What is the best way to do this?
PS: I haven't been successful in finding a way to do this.
Thanks
Upvotes: 0
Views: 1193
Reputation: 862581
Filter columns names if 1
in values per rows:
c = df.columns.to_numpy()
df['result'] = df.apply(lambda x: list(c[x == 1]), axis=1)
Alternative is faster:
c = df.columns.to_numpy()
df['result'] = [c[x == 1] for x in df.to_numpy()]
print (df)
cat dog duck fish horse result
0 0 0 0 0 1 [horse]
1 0 1 0 0 1 [dog, horse]
2 1 1 0 0 0 [cat, dog]
3 0 1 0 0 0 [dog]
#[4000 rows x 100 columns]
df = pd.concat([df] * 1000, ignore_index=True)
df = pd.concat([df] * 20, ignore_index=True, axis=1).add_prefix('test')
In [38]: %%timeit
...: c = df.columns.to_numpy()
...: df.apply(lambda x: list(c[x == 1]), axis=1)
...:
...:
526 ms ± 33.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [62]: %%timeit
...: c = df.columns.to_numpy()
...: [c[x == 1] for x in df.to_numpy()]
...:
...:
12.1 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Anothter solutions:
In [58]: %%timeit
...: df.dot(df.columns + ',').str.strip(',').str.split(',')
...:
50.7 ms ± 5.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [63]: %%timeit
...: df.mask(df.eq(0)).stack().reset_index(-1).groupby(level=0).agg({'level_1' : list }).values
...:
...:
162 ms ± 6.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 1
Reputation: 14949
You can try:
df['result'] = df.dot(df.columns + ',').str.strip(',').str.split(',')
Alternative:
df['result'] = df.mask(df.eq(0)).stack().reset_index(-1).groupby(level=0).agg({'level_1' : list }).values
cat dog duck fish horse result
0 0 0 0 0 1 [horse]
1 0 1 0 0 1 [dog, horse]
2 1 1 0 0 0 [cat, dog]
3 0 1 0 0 0 [dog]
Upvotes: 0