Laurent R
Laurent R

Reputation: 832

apply mask on np.array in pandas

I have a pd.DataFrame containing a mask and np.array. I want to apply the mask on the array (like I would do with np.where)

Does anyone have an idea how to succeed ?

df = pd.DataFrame({'Mask'   : [[True, False, True], [False, False], [True, True]],
                   'Array'  : [[2, 5,4]           , [1, 0]        , [4, 5],],
                   'Result' : [[2, 4]             , []            , [4,5]]})

def ffilter(entry):
    return entry['Array']['Mask']

df.apply(ffilter) #--> Nope too easy :-(

Upvotes: 0

Views: 632

Answers (2)

Stefan Falk
Stefan Falk

Reputation: 25367

You could just create a mask by using df.Mask, pass it to the mask() function of the data frame and aggregate.

This would be the "one-liner":

pd.DataFrame(df.Array.tolist())\
    .mask(np.asarray(df.Mask.tolist()))\
    .agg(['mean', 'std', 'min', 'max'])

which gives you:

        0         1
mean  1.0  2.500000
std   NaN  3.535534
min   1.0  0.000000
max   1.0  5.000000

Or as a whole:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Mask'   : [[True, False], [False, False], [True, True]],
                   'Array'  : [[2, 5]       , [1, 0]        , [4, 5],],
                   'Result' : [[2]          , []            , [4, 5]]})

df_Array = pd.DataFrame(df.Array.tolist())
mask = np.asarray(df.Mask.tolist())

df_Array.mask(mask).agg(['mean', 'std', 'min', 'max'])

From the comments, it is still not clear what your desired output is. I'll just assume you want to calculate statistics like min, max, std etc for each of these array in your data frame - and further - have a data frame where each row represents one of those arrays:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Mask'   : [[True, False, True], [False, False], [True, True]],
                   'Array'  : [[2, 5,4]           , [1, 0]        , [4, 5],],
                   'Result' : [[2, 4]             , []            , [4,5]]})

df_stats = df.apply(lambda x: pd.Series(x.Array)[x.Mask]
                    .agg(['min', 'max', 'std', 'mean']), 1)

print(df_stats)

which produces:

   min  max       std  mean
0  2.0  4.0  1.414214   3.0
1  NaN  NaN       NaN   NaN
2  4.0  5.0  0.707107   4.5

Upvotes: 2

Laurent R
Laurent R

Reputation: 832

That does the trick even if it's not really pythonic.

arr = df.Array.tolist()
mask = df.Mask.tolist()

result = [[np.asarray(a)[m]] for a, m in zip(arr, (mask))]
result

>>>[[array([2, 4])], [array([], dtype=int64)], [array([4, 5])]]

Upvotes: 0

Related Questions