Reputation: 147
Given a dataframe like this:
id column1 column2 column3 column4 columns5
1 a b a b x
2 b a c b y
3 c b a c x
Considering these groups:
group_1 = [a,b,c]
group_2 = [x,y,z]
How can I get this output:
%_a %_b %_c %_x %_y %_z
0.3333 0.4167 0.25 0.667 0.333 0
because:
% of elements of group_1: a=4/12, b=5/12, c=3/12
% of elements of group_2: x=2/3, y=1/3, z=0
Upvotes: 0
Views: 101
Reputation: 25323
In case, elements of group_1
and group_2
can not coexist in a same column, a possible solution is, which is based on numpy
:
def get_percs(df, group, cols):
size = df.iloc[:,cols].values.size
return np.sum(
np.sum(df.iloc[:,cols].values ==
np.array(group)[:,None][:,None], axis=2),
axis=1) / size
get_percs(df, group_1, range(1,5)), get_percs(df, group_2, range(5,6))
Output:
(array([0.33333333, 0.41666667, 0.25 ]),
array([0.66666667, 0.33333333, 0. ]))
If the output is required to be a dataframe:
pd.concat([pd.DataFrame(get_percs(df, g[0], g[1]).reshape(1,-1),
columns=[f'%_{x}' for x in g[0]])
for g in zip([group_1, group_2], [range(1,5), range(5,6)])], axis=1)
Output:
%_a %_b %_c %_x %_y %_z
0 0.333333 0.416667 0.25 0.666667 0.333333 0.0
Upvotes: 1
Reputation: 14064
Here's an approach with df.stack
:
group_1 = ['a','b','c']
group_2 = ['x','y','z']
s = df.stack()
res = (pd
.concat(
[s[s.isin(g)].value_counts(normalize=True)
.reindex(g).fillna(0)
for g in [group_1, group_2]]
)
.to_frame().T.add_prefix('%_')
)
res
%_a %_b %_c %_x %_y %_z
0 0.333333 0.416667 0.25 0.666667 0.333333 0.0
Explanation
df.stack
to turn the df
into a pd.Series
.Series.isin
), and to this selection we apply Series.value_counts
with normalize
parameter set to True
.Series.reindex
to re-add any values from the group that are not present in the selection (i.e. z
from group_1
); we supply the value 0
for any such missing values by using Series.fillna
.pd.concat
to get back one Series
.Series
into a df
(Series.to_frame
), transpose it (df.T
), and add a prefix to the columns (df.add_prefix
).Upvotes: 1
Reputation: 37747
Here is a proposition with melt
& value_counts
:
g1 = df.isin(group_1).sum().sum()
g2 = df.isin(group_2).sum().sum()
d = {**{k:g1 for k in group_1}, **{k:g2 for k in group_2}}
out = (
df
.set_index("id").melt()
["value"].value_counts().to_frame()
.join(pd.Series(d).rename("c"), how="outer").fillna(0)
.assign(temp_col= lambda x: x.pop("value").div(x.pop("c")))
.T.sort_index(axis=1).add_prefix("%_").reset_index(drop=True)
)
Output :
print(out)
%_a %_b %_c %_x %_y %_z
0 0.333333 0.416667 0.25 0.666667 0.333333 0.0
Upvotes: 1
Reputation: 161
You can use boolean masks for this. For example:
df.isin(group_1)
Will give you a boolean mask where the values of the DataFrame that are in the list group_1
are True and the rest are False.
Then counting the number of values that are true is as simple as using the method .sum()
twice or .values.sum()
(ie. convert to numpy array and then sum the true values).
On the same way, you can iterate through the elements of the group and use a boolean mask to filter individual values eg.
for i in group_1:
print(f"Number of {i}= {(df==i).values.sum()}")
For the solution I would put the groups in a dictionary and then iterate through the dictionary:
groups = {'group1': group_1, 'group2': group_2 }
results = {}
for group_name, group in groups.items():
count_group = df.isin(group).values.sum()
results[group_name] = {}
for val in group:
count_value = (df==val).values.sum()
results[group_name][val] = count_value/count_group
print(results)
Upvotes: 1