Reputation: 23
Suppose I have 10 rows and 3 columns. Each row has only one non zero value, which location by column differs between any row.
How can I iterate over each row, check if any non zero value exists, check its column index, and get (store in memory) the appropriate column name?
I tried a couple of approaches using the pandas library, but didn't find any short way.
Raw data:
index A B C D
11 0 3 0 0
136 0 0 0 1
186 0 0 184 0
258 0 0 15 0
455 0 1 0 0
628 0 0 1 0
774 0 0 2 0
829 0 0 1 0
1004 0 0 113 0
1008 0 0 1 0
Example of one of the expected outputs for 3 different indexes (136, 455, 1008):
['D', 'B', 'C']
Of course the output could be anything else like list of lists. But this case specific example is enough for me.
Found that:
empty_list = []
memory_var = [empty_list.append(row.sort_values().max()) for row in
df.itterows()]
print(memory_var)
This helps me to make a list with only the highest values of each row. Now, I need to get column names of those values. Using "idxmax()" method returns an error:
'numpy.float64' object has no attribute 'idxmax'
Update:
Found the correct solution for my case. The extraction of a specific column could be done by:
row[:].sort_values().idxmax()
Upvotes: 2
Views: 3572
Reputation: 3001
idxmax
is perfect for this. Original data includes Column D, so I used it, too:
from io import StringIO
import pandas as pd
data = '''index A B C D
11 0 3 0 0
136 0 0 0 1
186 0 0 184 0
258 0 0 15 0
455 0 1 0 0
628 0 0 1 0
774 0 0 2 0
829 0 0 1 0
1004 0 0 113 0
1008 0 0 1 0
'''
df = (pd.read_csv(StringIO(data), sep='\s\s+', engine='python')
.set_index('index'))
# apply idxmax and show result
df['col'] = df.apply(lambda x: x.idxmax(), axis=1)
print(df)
A B C D col
index
11 0 3 0 0 B
136 0 0 0 1 D
186 0 0 184 0 C
258 0 0 15 0 C
455 0 1 0 0 B
628 0 0 1 0 C
774 0 0 2 0 C
829 0 0 1 0 C
1004 0 0 113 0 C
1008 0 0 1 0 C
Upvotes: 4
Reputation: 195
You can make use of the apply function of DF.
Iterate over the row and check at which index of the row the non-zero value is present. To get the column name use the df.columns.
data = [[0, 0, 1], [0, 2, 0], [3, 0, 0], [0, 0, 4], [0, 5, 0], [6, 0, 0], [0, 0, 7], [0, 8, 0], [9, 0, 0], [0, 0, 10]]
df = pd.DataFrame(data, columns=["a", "b", "c"])
df["d"] = df.apply(lambda row: df.columns[[i for i in range(len(row)) if row[i] > 0][0]], axis=1)
a b c d
0 0 0 1 c
1 0 2 0 b
2 3 0 0 a
3 0 0 4 c
4 0 5 0 b
5 6 0 0 a
6 0 0 7 c
7 0 8 0 b
8 9 0 0 a
9 0 0 10 c
Update: To get the names of the columns with non-zero values:
column_list = df["d"].tolist()
['c', 'b', 'a', 'c', 'b', 'a', 'c', 'b', 'a', 'c']
To only get the names of the columns at the specific locations:
df["d"].iloc[[1, 3, 5, 7]].tolist()
['b', 'c', 'a', 'b']
Upvotes: 1
Reputation: 146
An alternate approach that ends up adding a column at the end of the original DataFrame containing the non-zero column names:
df = pd.DataFrame({'x' : [ 0,0,9],'y' : [-1,0,0],'z' : [0,1.1,0]},index=['a','b','c'])
df["non_zero_columns"] = df.where(df == 0,
other=df.apply(lambda x: x.name),
axis=1).where(df != 0,
other="").apply(lambda row: ''.join(row.values), axis=1)
'df' is now:
x y z non_zero_columns
a 0 -1 0.0 y
b 0 0 1.1 z
c 9 0 0.0 x
Breaking apart the one-liner, the two where() calls set zeros to empty strings and non-zero values to the name of the column they're in. These values (empty strings and column names) are string concatenated in the apply() to create the "non_zero_columns" column.
Upvotes: 1
Reputation: 308
Is this what you're looking for?
>>> a_df = pd.DataFrame({"a":[1,0,0,0], "b":[0,3,0,0], "c":[0,0,0,7], "d":[0,0,0,0]})
>>> a_df
a b c d
0 1 0 0 0
1 0 3 0 0
2 0 0 0 0
3 0 0 7 0
>>> a_df = a_df.replace(0, np.nan)
>>> a_df
a b c d
0 1.0 NaN NaN NaN
1 NaN 3.0 NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN 7.0 NaN
>>> a_df.dropna(how="all", axis=1).columns.tolist()
['a', 'b', 'c']
If you want to preserve the orginal df:
>>> a_df.replace(0, np.nan).dropna(how="all", axis=1).columns.tolist()
['a', 'b', 'c']
>>> a_df
a b c d
0 1 0 0 0
1 0 3 0 0
2 0 0 0 0
3 0 0 7 0
Upvotes: 1