Reputation: 311
I have a df that looks like:
C E H
window
(AAA, AAA, AAA) 26 4 111
(AAA, AAA, AAC) 3 1 1
And a dictionary called p_dict
.
Each value in df['window']
has three sets of letters, and each three set is a key in my p_dict
. What i've done so far to achieve what I want is:
dim_list = []
for word in df['window']:
a = p_dict[word[2:5]] # len of 100
b = p_dict[word[9:12]] # len of 100
c = p_dict[word[16:19]] # len of 100
flav = [statistics.mean(k) for k in zip(a, b, c)]
dim_list.append(flav)
df['dimensions'] = dim_list
But this process is very long for a df with 1mil rows. Is there any other way of doing this?
Edit
p_dict
looks like
{'AAA':[0.2, 12, 301..], 'AAC':[31, 0.91, 8..]}
where each value is an embedding in a 100 dimensional space.
What I want to get:
For each triplet in window, obtain the 100 dimensions from the dictionary and work out the average to get one average list of dimensions.
so for window (AAA, AAA, AAC)
:
AAA -> p_dict['AAA'] -> [100 dimensions] # list 1
AAA -> p_dict['AAA'] -> [100 dimensions] # list 2
AAC -> p_dict['AAC'] -> [100 dimensions] # list 3
output = average of list 1 + 2 + 3
Upvotes: 0
Views: 69
Reputation: 150735
You want to split the words in windows
so that you have a n x 3
dataframe. Then use replace
and mean(axis=1)
:
df = pd.DataFrame({'window': ['(AAA, AAA, AAA)', '(AAA, AAA, AAC)'],
'C': [26, 3],
'E': [4, 1],
'H': [111, 1]})
p_dict = {'AAA':1, 'AAC':2}
(df['window'].str[1:-1]
.str.split(',\s*', expand=True)
.replace(p_dict).mean(axis=1)
)
gives:
0 1.000000
1 1.333333
dtype: float64
In the case your p_dict
is dict of lists, we only need to tweak a little:
p_dict = {'AAA':[0.2, 12, 301.], 'AAC':[31, 0.91, 8.]}
p_df = pd.DataFrame(p_dict).T
new_df = (df['window'].str[1:-1]
.str.split(',\s*', expand=True)
.stack()
)
pd.DataFrame(p_df.loc[new_df].values,
index=new_df.index).mean(level=0)
gives you:
0 1 2
0 0.200000 12.000000 301.000000
1 10.466667 8.303333 203.333333
Note it only works if the lists in the dict are of the same size for now.
Upvotes: 1