Charming Imp
Charming Imp

Reputation: 23

Python: Get column name by non zero value in row

Suppose I have 10 rows and 3 columns. Each row has only one non zero value, which location by column differs between any row.

How can I iterate over each row, check if any non zero value exists, check its column index, and get (store in memory) the appropriate column name?

I tried a couple of approaches using the pandas library, but didn't find any short way.

Raw data:

index  A  B    C  D                 
11     0  3    0  0     
136    0  0    0  1     
186    0  0  184  0     
258    0  0   15  0     
455    0  1    0  0     
628    0  0    1  0    
774    0  0    2  0     
829    0  0    1  0     
1004   0  0  113  0     
1008   0  0    1  0     

Example of one of the expected outputs for 3 different indexes (136, 455, 1008):

['D', 'B', 'C']

Of course the output could be anything else like list of lists. But this case specific example is enough for me.

Found that:

empty_list = []    
memory_var = [empty_list.append(row.sort_values().max()) for row in 
             df.itterows()]
print(memory_var)

This helps me to make a list with only the highest values of each row. Now, I need to get column names of those values. Using "idxmax()" method returns an error:

'numpy.float64' object has no attribute 'idxmax'

Update:

Found the correct solution for my case. The extraction of a specific column could be done by:

row[:].sort_values().idxmax()

Upvotes: 2

Views: 3572

Answers (4)

jsmart
jsmart

Reputation: 3001

idxmax is perfect for this. Original data includes Column D, so I used it, too:

from io import StringIO
import pandas as pd

data = '''index  A  B    C  D                 
11     0  3    0  0     
136    0  0    0  1     
186    0  0  184  0     
258    0  0   15  0     
455    0  1    0  0     
628    0  0    1  0    
774    0  0    2  0     
829    0  0    1  0     
1004   0  0  113  0     
1008   0  0    1  0   
'''
df = (pd.read_csv(StringIO(data), sep='\s\s+', engine='python')
        .set_index('index'))
# apply idxmax and show result
df['col'] = df.apply(lambda x: x.idxmax(), axis=1)
print(df)

       A  B    C  D col
index                  
11     0  3    0  0   B
136    0  0    0  1   D
186    0  0  184  0   C
258    0  0   15  0   C
455    0  1    0  0   B
628    0  0    1  0   C
774    0  0    2  0   C
829    0  0    1  0   C
1004   0  0  113  0   C
1008   0  0    1  0   C

Upvotes: 4

NullPointer
NullPointer

Reputation: 195

You can make use of the apply function of DF.

Iterate over the row and check at which index of the row the non-zero value is present. To get the column name use the df.columns.

data = [[0, 0, 1], [0, 2, 0], [3, 0, 0], [0, 0, 4], [0, 5, 0], [6, 0, 0], [0, 0, 7], [0, 8, 0], [9, 0, 0], [0, 0, 10]]

df = pd.DataFrame(data, columns=["a", "b", "c"])

df["d"] = df.apply(lambda row: df.columns[[i for i in range(len(row)) if row[i] > 0][0]], axis=1)

   a  b   c  d
0  0  0   1  c
1  0  2   0  b
2  3  0   0  a
3  0  0   4  c
4  0  5   0  b
5  6  0   0  a
6  0  0   7  c
7  0  8   0  b
8  9  0   0  a
9  0  0  10  c

Update: To get the names of the columns with non-zero values:

column_list = df["d"].tolist()
['c', 'b', 'a', 'c', 'b', 'a', 'c', 'b', 'a', 'c']

To only get the names of the columns at the specific locations:

df["d"].iloc[[1, 3, 5, 7]].tolist()
['b', 'c', 'a', 'b']

Upvotes: 1

Josh
Josh

Reputation: 146

An alternate approach that ends up adding a column at the end of the original DataFrame containing the non-zero column names:

df = pd.DataFrame({'x' : [ 0,0,9],'y' : [-1,0,0],'z' : [0,1.1,0]},index=['a','b','c'])
df["non_zero_columns"] = df.where(df == 0, 
                                  other=df.apply(lambda x: x.name), 
                                  axis=1).where(df != 0, 
                                                other="").apply(lambda row: ''.join(row.values), axis=1)

'df' is now:

    x   y   z   non_zero_columns
a   0   -1  0.0 y
b   0   0   1.1 z
c   9   0   0.0 x

Breaking apart the one-liner, the two where() calls set zeros to empty strings and non-zero values to the name of the column they're in. These values (empty strings and column names) are string concatenated in the apply() to create the "non_zero_columns" column.

Upvotes: 1

PiyushC
PiyushC

Reputation: 308

Is this what you're looking for?

>>> a_df = pd.DataFrame({"a":[1,0,0,0], "b":[0,3,0,0], "c":[0,0,0,7], "d":[0,0,0,0]})
>>> a_df
   a  b  c  d
0  1  0  0  0
1  0  3  0  0
2  0  0  0  0
3  0  0  7  0

>>> a_df = a_df.replace(0, np.nan)
>>> a_df
     a    b    c   d
0  1.0  NaN  NaN NaN
1  NaN  3.0  NaN NaN
2  NaN  NaN  NaN NaN
3  NaN  NaN  7.0 NaN

>>> a_df.dropna(how="all", axis=1).columns.tolist()
['a', 'b', 'c']

If you want to preserve the orginal df:

>>> a_df.replace(0, np.nan).dropna(how="all", axis=1).columns.tolist()
['a', 'b', 'c']

>>> a_df
   a  b  c  d
0  1  0  0  0
1  0  3  0  0
2  0  0  0  0
3  0  0  7  0

Upvotes: 1

Related Questions