Reputation: 1758
I have a problem for which I managed to write some working code, but I'd like to see if anyone here could have a simpler / more organized / less ugly / more in-built solution. Sorry for the extremely vague title, but I wasn't able to summarize the issue in one sentence.
The problem
Basically I have a DataFrame that looks like this:
id foo_col A B C D
0 x nothing 2 0 1 1
1 y to 0 0 3 2
2 z see 1 3 2 2
Now I'd like to transform the columns ['A', 'B', 'C', 'D']
into ['W1', 'W2', 'W3']
, that would be the first 3 column names (per row) sorted using the numbers in each row.
In this way, the row with id x
would have A
(with 2), C
(with 1), D
(with 1), B
(with 0), thus getting 'W1' = 'A'
, 'W2' = 'C'
, 'W3' = 'D'
.
The target DataFrame would then look like this:
id foo_col W1 W2 W3
0 x nothing A C D
1 y to C D None
2 z see B C D
The rules
x
);W
s, the ones that are missing will get None
(row y
);W
s, the extra one will not make it in the final DataFrame (row z
).The solution
import pandas as pd
import operator
import more_itertools as mit
# Define starting DataFrame
df = pd.DataFrame(data={'id': ['x', 'y', 'z'],
'foo_col': ['nothing', 'to', 'see'],
'A': [2, 0, 1],
'B': [0, 0, 3],
'C': [1, 3, 2],
'D': [1, 2, 2]})
print('Original DataFrame')
print(df.to_string())
print()
# Define 'source' and 'target' columns
w_columns = ['A', 'B', 'C', 'D']
w_labels = ['W1', 'W2', 'W3']
# Define function to do this pivoting
def pivot_w(row, columns=w_columns, labels=w_labels):
# Convert relevant columns of DF to dictionary
row_dict = row[columns].to_dict()
# Convert dictionary to list of tuples
row_tuples = [tuple(d) for d in row_dict.items()]
# Sort list of tuples based on the second item (the value in the cell)
row_tuples.sort(key=operator.itemgetter(1), reverse=True)
# Get the sorted 'column' labels
row_list = [x[0] for x in row_tuples if x[1] != 0]
# Enforce rules 2 and 3
if len(row_list) < 3:
row_list = list(mit.take(3, mit.padnone(row_list)))
else:
row_list = row_list[:3]
# Create a dictionary using the W lables
output = {i: j for i, j in zip(labels, row_list)}
return output
# Get DataFrame with W columns and index
df_w = pd.DataFrame(list(df.apply(pivot_w, axis=1)))
# Merge DataFrames on index
df = df.merge(df_w, how='inner', left_index=True, right_index=True)
# Drop A, B, C, D columns
df.drop(columns=w_columns, inplace=True)
print('Final DataFrame')
print(df.to_string())
Aside for maybe re-using the same variable to store the in-between results in the function, is there anything smarter I could do?
P.S. If anyone of you has an idea for a better/clearer title please feel free to edit!
Upvotes: 4
Views: 332
Reputation: 75080
here is one way:
l=['W1', 'W2', 'W3']
m=df.set_index(['id','foo_col'])
m=(m.replace(0,np.nan).apply(lambda x: x.nlargest(3),axis=1).notna().dot(m.columns+',')
.str[:-1].str.split(',',expand=True))
m.columns=l
m.reset_index()
id foo_col W1 W2 W3
0 x nothing A C D
1 y to C D None
2 z see B C D
Upvotes: 2
Reputation: 2757
(df[['A','B','C','D']]
.stack()
.loc[lambda x:x!=0]
.reset_index()
.sort_values(by=['level_0',0], ascending=False)
.groupby('level_0').apply(lambda x:x.reset_index())['level_1']
.reindex([0,1,2],level=1)
.rename(lambda x:'W'+str(x+1),level=1)
.unstack())
Upvotes: 0
Reputation: 862761
You can use argsort for get top3 columns names, but then is necessary replace positions from 0
values with sorting and np.where
:
w_columns = ['A', 'B', 'C', 'D']
w_labels = ['W1', 'W2', 'W3']
#sorting columns names by values, last are 0 values (because minimal)
arr = np.array(w_columns)[np.argsort(-a, axis=1)]
print (arr)
[['A' 'C' 'D' 'B']
['C' 'D' 'A' 'B']
['B' 'C' 'D' 'A']]
#sorting values for 0 to last positions and compare by 0
mask = -np.sort(-df[w_columns], axis=1) == 0
print (mask)
[[False False False True]
[False False True True]
[False False False False]]
#replace first 3 'columns' by mask to None
out = np.where(mask[:, :3], None, arr[:, :3])
print (out)
[['A' 'C' 'D']
['C' 'D' None]
['B' 'C' 'D']]
df1 = pd.DataFrame(out, columns=w_labels, index=df.index)
print (df1)
W1 W2 W3
0 A C D
1 C D None
2 B C D
df = df.drop(w_columns, 1).join(df1)
print (df)
id foo_col W1 W2 W3
0 x nothing A C D
1 y to C D None
2 z see B C D
If possible need exclude some another value whic is not minimal in all seelcted values is possible repalce it to NaN
s and for test use np.isnan
:
a = np.where(df[w_columns] != 0, df[w_columns], np.nan)
print (a)
[[ 2. nan 1. 1.]
[nan nan 3. 2.]
[ 1. 3. 2. 2.]]
arr = np.array(w_columns)[np.argsort(-a, axis=1)]
mask = np.isnan(np.sort(a, axis=1))
out = np.where(mask[:, :3], None, arr[:, :3])
print (out)
[['A' 'C' 'D']
['C' 'D' None]
['B' 'C' 'D']]
Upvotes: 3