Reputation: 16050
I want to check how many times key1
, key2
and key3
were repeated in df
, and for each key to estimate an average value of num
.
Now I can create keys_summary
for only single key (e.g. key1
). How can I count the frequency of all keys in a single line? Also, how to add the average values of num
into the keys_summary
?
import pandas as pd
s1 = pd.Series(['aaa abc','aaa cba','bbb bbc','aaa cba','bbb bbc'])
s2 = pd.Series([3,5,1,4,0])
df = pd.DataFrame({'descr': s1, 'num': s2})
print df
k1 = pd.Series(['aaa','abc','cba'])
k2 = pd.Series(['bbb','bbc'])
k3 = pd.Series(['ddd','ddc'])
keys = pd.DataFrame({'key1': k1,'key2': k2, 'key3': k3})
print keys
keys_summary = df['descr'].groupby(lambda x : x in keys['key1']).count()
print keys_summary
The expected result must be the following:
keys_summary
count avg_num
key1 3 4
key2 2 1
key3 0 0
Upvotes: 0
Views: 88
Reputation: 4326
Edit: here's an alternative answer that does not rely on building a permutation list, rather it expects that the values for each key be disjoint (that is, no value belongs to more than one string). Given the dataframes df
and keys
:
keys_summary = pd.DataFrame()
for col in keys:
keys_summary[col] = df[df.descr.apply(lambda x: any(word in x.split() for word in keys[col]))].describe().num
keys_summary = keys_summary.transpose()
This results in the same dataframe as given below, for the example in the question.
Given your dataframe of keys
, you can generate lists of permutations 2 by 2 (if descr
is always two key-substrings, that is - else some minor modifications might be necessary) and then check for them in df
. Use the describe
method to get the stats.
import itertools
#df and keys as given
key_dict = {}
for col in keys:
perms = []
for (a,b) in itertools.permutations(keys[col].tolist(), 2):
perms.append(str(a) + ' ' + str(b))
key_dict[col] = perms
Then you use this dictionary and the df
dataframe to generate a keys_summary
:
keys_summary = pd.DataFrame()
for k,v in key_dict.items():
keys_summary[k] = df[df.descr.isin(v)].describe().num
keys_summary = keys_summary.transpose()
This results in:
count mean std min 25% 50% 75% max
key3 0 NaN NaN NaN NaN NaN NaN NaN
key2 2 0.5 0.707107 0 0.25 0.5 0.75 1
key1 3 4.0 1.000000 3 3.50 4.0 4.50 5
You can drop the columns min, 25%, max, etc. if you don't need them.
Upvotes: 2