vikingd
vikingd

Reputation: 43

Combine values from multiple columns into a list in each row using python

I have the following dataframe.

data = [[0, 0, 0, 0, 1], [0, 1, 0, 0, 1], [1, 1, 0, 0, 0], [0, 1, 0, 0, 0]]
labels = ['cat', 'dog', 'duck', 'fish', 'horse']
df = pd.DataFrame(data, columns = labels)
df: 
    cat dog duck fish horse
0    0   0   0    0    1
1    0   1   0    0    1
2    1   1   0    0    0
3    0   1   0    0    0

I have got the 0's and 1's from another dataframe based on a certain condition. I want to combine the column names for values corresponding to true values. i.e 1's into a list and put it into a new column at the end of the dataframe.

I want my result to look like this.

   cat  dog  duck  fish horse   result
0   0    0    0     0    1     [horse]
1   0    1    0     0    1     [dog, horse]
2   1    1    0     0    0     [cat, dog]
3   0    1    0     0    0     [dog] 

I have about 108 columns in my original dataframe and around 3500 rows. What is the best way to do this?

PS: I haven't been successful in finding a way to do this.

Thanks

Upvotes: 0

Views: 1193

Answers (2)

jezrael
jezrael

Reputation: 862581

Filter columns names if 1 in values per rows:

c = df.columns.to_numpy()
df['result'] = df.apply(lambda x: list(c[x == 1]), axis=1)

Alternative is faster:

c = df.columns.to_numpy()
df['result'] = [c[x == 1] for x in df.to_numpy()]

print (df)
   cat  dog  duck  fish  horse        result
0    0    0     0     0      1       [horse]
1    0    1     0     0      1  [dog, horse]
2    1    1     0     0      0    [cat, dog]
3    0    1     0     0      0         [dog]

#[4000 rows x 100 columns]
df = pd.concat([df] * 1000, ignore_index=True)
df = pd.concat([df] * 20, ignore_index=True, axis=1).add_prefix('test')
    

In [38]: %%timeit
    ...: c = df.columns.to_numpy()
    ...: df.apply(lambda x: list(c[x == 1]), axis=1)
    ...: 
    ...: 
526 ms ± 33.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [62]: %%timeit
    ...: c = df.columns.to_numpy()
    ...: [c[x == 1] for x in df.to_numpy()]
    ...: 
    ...: 
12.1 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Anothter solutions:

In [58]: %%timeit
    ...: df.dot(df.columns + ',').str.strip(',').str.split(',')
    ...: 
50.7 ms ± 5.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [63]: %%timeit
    ...: df.mask(df.eq(0)).stack().reset_index(-1).groupby(level=0).agg({'level_1' : list }).values
    ...: 
    ...: 
162 ms ± 6.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 1

Nk03
Nk03

Reputation: 14949

You can try:

df['result'] = df.dot(df.columns + ',').str.strip(',').str.split(',')

Alternative:

df['result'] = df.mask(df.eq(0)).stack().reset_index(-1).groupby(level=0).agg({'level_1' : list }).values

OUTPUT:

   cat  dog  duck  fish  horse        result
0    0    0     0     0      1       [horse]
1    0    1     0     0      1  [dog, horse]
2    1    1     0     0      0    [cat, dog]
3    0    1     0     0      0         [dog]

Upvotes: 0

Related Questions