Reputation: 542
I am trying to apply the following function for each row in a dataframe. The dataframe looks as follows:
vote_1 vote_2 vote_3 vote_4
a a a b
b b a b
b a a b
I am tring to generate a fourth column to sum the 'votes' of the other columns and produce the winner, as follows:
vote_1 vote_2 vote_3 vote_4 winner_columns
a a a b a
b b a b b
b a a b draw
I have currently tried:
def winner(x):
a = new_df.iloc[x].value_counts()['a']
b = new_df.iloc[x].value_counts()['b']
if a > b:
y = 'a'
elif a < b:
y = 'b'
else:
y = 'draw'
return y
df['winner_columns'].apply(winner)
However the whole column gets filled with draws. I assume is something with the way I have build the function but can't figure out what
Upvotes: 1
Views: 730
Reputation: 1319
You can use .sum() for counting the votes, then you save in a list the winners, finally you add into dataframe.
numpy_votes = dataframe_votes.to_numpy()
winner_columns = []
for i in numpy_votes:
if np.sum(i == 'a') < np.sum(i == 'b'):
winner_columns.append('b')
elif np.sum(i == 'a') > np.sum(i == 'b'):
winner_columns.append('a')
else:
winner_columns.append('draw')
dataframe_votes['winner_columns'] = winner_columns
Using .sum() method is the fastest way to count elements inside arrays according to this answer.
Output:
vote_1 vote_2 vote_3 vote_4 winner_columns
0 a a a b a
1 b b a b b
2 b a a b draw
Upvotes: 0
Reputation: 30971
The first problem is that your DataFrame contains sometimes a letter followed by a dot.
So to look for solely 'a'
or 'b'
you have to replace these dots
with an empty string, something like:
df.replace('\.', '', regex=True)
Another problem, which didin't surface in your case, is that a row can
contain only 'a'
or 'b'
and your code should be resistant to
absence of particular result in such a source row.
To make your function resistant to such cases, change it to:
def winner(row):
vc = row.value_counts()
a = vc.get('a', 0)
b = vc.get('b', 0)
if a > b: return 'a'
elif a < b: return 'b'
else: return 'draw'
Then you can apply your function, but if you want to apply it to each row (not column), you should pass axis=1.
So, to sum up, change your code to:
df['winner_columns'] = df.replace('\.', '', regex=True).apply(winner, axis=1)
The result, for your sample data, is:
vote_1 vote_2 vote_3 vote_4 winner_columns
0 a. a. a. b a
1 b. b. a b b
2 b. a. a b draw
Upvotes: 1
Reputation: 862441
You can use DataFrame.mode
and count non missing values by DataFrame.count
, if only one use first column else draw
in numpy.where
:
df1 = df.mode(axis=1)
print (df1)
0 1
0 a NaN
1 b NaN
2 a b
df['winner_columns'] = np.where(df1.count(axis=1).eq(1), df1[0], 'draw')
print (df)
vote_1 vote_2 vote_3 vote_4 winner_columns
0 a a a b a
1 b b a b b
2 b a a b draw
Your solution is possible change:
def winner(x):
s = x.value_counts()
a = s['a']
b = s['b']
if a > b:
y = 'a'
elif a < b:
y = 'b'
else:
y = 'draw'
return y
df['winner_columns'] = df.apply(winner,axis=1)
print (df)
vote_1 vote_2 vote_3 vote_4 winner_columns
0 a a a b a
1 b b a b b
2 b a a b draw
Upvotes: 2