Reputation: 51

Efficient python data transformation with pandas/numpy

I got df like this:

cols=['a', 'b']
df = pd.DataFrame([[[3,1,5,7], [42,31]], [[],[44]], [[44,3,5,5,5,10],[]], [[], [44324,3]]],
                   columns=cols)

As you see theres list in every cell. I want to to followings things on each of element:

Calculate mean of list and append 5
If result <= 0, add 1 in place of list
If list is empty, add 0 in place of list

My working solution:

df
def convert_list(x):
    if len(x) != 0:
        res = (sum(x)/len(x)) + 5
        if res <= 0:
            res = 1
        return res
    return 0

for col in cols:
    df[col] = df[col].apply(lambda x: convert_list(x))

Desired output:

df

It's working but its very slow solution (in original df I got about 50k columns and 100k rows, and list might contains many elements). Is there any efficient solution for this? I also tried convert it to numpy array and do some vectecorized operations, but the problem is every list might have different length, so I cant convert it (unless I add many elements to other lists...)

Upvotes: 3

Answers (2)

hpchavaz

Reputation: 1388

You could try using explode then an aggregation, with the idea of avoiding apply.

Something like that, but it is ugly.

def correct_column(col):
    cole = col.explode()
    m_empty = cole.isna().groupby(cole.index).first()
    groups = cole.groupby(cole.index)
    col_sum = groups.sum()
    col_count = groups.count()
    col = groups.sum() / groups.count() + 5
    m_negative = col <= 0
    col[m_empty] = 0
    col[m_negative] = 1
    return col

for col in df.columns:
    df[col] = correct_column(df[col])

Note: groups.sum() / groups.count() may be strange way to get the mean but for empty lists (NaN after explode) mean raises an error when sum returns 0.

Upvotes: 1

Chris

Reputation: 16147

You could use applymap and np.mean to average every cell and add 5. Then any value below 5 would have been a negative mean, and nans can be filled with zero.

import pandas as pd
import numpy as np
cols=['a', 'b']
df = pd.DataFrame([[[3,1,5,7], [42,31]], [[],[44]], [[44,3,5,5,5,10],[]], [[], [44324,3]]],
                   columns=cols)


df = (df.applymap(np.mean)+5)
df[df<5]=1
df = df.fillna(0)

Output

      a        b
0   9.0     41.5
1   0.0     49.0
2  17.0      0.0
3   0.0  22168.5

Upvotes: 2

Efficient python data transformation with pandas/numpy

Answers (2)

Related Questions