Haj Sai
Haj Sai

Reputation: 311

Pandas: calculating mean value for rows

I have a df that looks like:

                     C   E    H
     window
(AAA, AAA, AAA)      26  4  111
(AAA, AAA, AAC)       3  1    1

And a dictionary called p_dict. Each value in df['window'] has three sets of letters, and each three set is a key in my p_dict. What i've done so far to achieve what I want is:

dim_list = []
for word in df['window']:
   a = p_dict[word[2:5]] # len of 100
   b = p_dict[word[9:12]] # len of 100
   c = p_dict[word[16:19]] # len of 100

   flav = [statistics.mean(k) for k in zip(a, b, c)]
   dim_list.append(flav)

df['dimensions'] = dim_list

But this process is very long for a df with 1mil rows. Is there any other way of doing this?

Edit p_dict looks like {'AAA':[0.2, 12, 301..], 'AAC':[31, 0.91, 8..]} where each value is an embedding in a 100 dimensional space.

What I want to get: For each triplet in window, obtain the 100 dimensions from the dictionary and work out the average to get one average list of dimensions. so for window (AAA, AAA, AAC):

AAA -> p_dict['AAA'] -> [100 dimensions] # list 1
AAA -> p_dict['AAA'] -> [100 dimensions] # list 2
AAC -> p_dict['AAC'] -> [100 dimensions] # list 3
output = average of list 1 + 2 + 3

Upvotes: 0

Views: 69

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150735

You want to split the words in windows so that you have a n x 3 dataframe. Then use replace and mean(axis=1):

df = pd.DataFrame({'window': ['(AAA, AAA, AAA)', '(AAA, AAA, AAC)'],
 'C': [26, 3],
 'E': [4, 1],
 'H': [111, 1]})

p_dict = {'AAA':1, 'AAC':2}

(df['window'].str[1:-1]
             .str.split(',\s*', expand=True)
             .replace(p_dict).mean(axis=1)
)

gives:

0    1.000000
1    1.333333
dtype: float64

In the case your p_dict is dict of lists, we only need to tweak a little:

p_dict = {'AAA':[0.2, 12, 301.], 'AAC':[31, 0.91, 8.]} 
p_df = pd.DataFrame(p_dict).T

new_df = (df['window'].str[1:-1]
             .str.split(',\s*', expand=True)
             .stack()
         )

pd.DataFrame(p_df.loc[new_df].values, 
             index=new_df.index).mean(level=0)

gives you:

           0          1           2
0   0.200000  12.000000  301.000000
1  10.466667   8.303333  203.333333

Note it only works if the lists in the dict are of the same size for now.

Upvotes: 1

Related Questions