Reputation: 51
I got df like this:
cols=['a', 'b']
df = pd.DataFrame([[[3,1,5,7], [42,31]], [[],[44]], [[44,3,5,5,5,10],[]], [[], [44324,3]]],
columns=cols)
As you see theres list in every cell. I want to to followings things on each of element:
My working solution:
df
def convert_list(x):
if len(x) != 0:
res = (sum(x)/len(x)) + 5
if res <= 0:
res = 1
return res
return 0
for col in cols:
df[col] = df[col].apply(lambda x: convert_list(x))
Desired output:
df
It's working but its very slow solution (in original df I got about 50k columns and 100k rows, and list might contains many elements). Is there any efficient solution for this? I also tried convert it to numpy array and do some vectecorized operations, but the problem is every list might have different length, so I cant convert it (unless I add many elements to other lists...)
Upvotes: 3
Views: 124
Reputation: 1388
You could try using explode
then an aggregation, with the idea of avoiding apply
.
Something like that, but it is ugly.
def correct_column(col):
cole = col.explode()
m_empty = cole.isna().groupby(cole.index).first()
groups = cole.groupby(cole.index)
col_sum = groups.sum()
col_count = groups.count()
col = groups.sum() / groups.count() + 5
m_negative = col <= 0
col[m_empty] = 0
col[m_negative] = 1
return col
for col in df.columns:
df[col] = correct_column(df[col])
Note: groups.sum() / groups.count()
may be strange way to get the mean but for empty lists (NaN
after explode
) mean
raises an error when sum
returns 0
.
Upvotes: 1
Reputation: 16147
You could use applymap
and np.mean
to average every cell and add 5. Then any value below 5 would have been a negative mean, and nans can be filled with zero.
import pandas as pd
import numpy as np
cols=['a', 'b']
df = pd.DataFrame([[[3,1,5,7], [42,31]], [[],[44]], [[44,3,5,5,5,10],[]], [[], [44324,3]]],
columns=cols)
df = (df.applymap(np.mean)+5)
df[df<5]=1
df = df.fillna(0)
Output
a b
0 9.0 41.5
1 0.0 49.0
2 17.0 0.0
3 0.0 22168.5
Upvotes: 2