Reputation: 4842
I have a df
:
my_data = {'order_id': {('group_1', 'A'): 2,
('group_1', 'B'): 4,
('group_1', 'C'): 6,
('group_1', 'D'): 5,
('group_10', 'A'): 6,
('group_10', 'B'): 4,
('group_10', 'C'): 7,
('group_10', 'D'): 13,
('group_2', 'A'): 3,
('group_2', 'B'): 4,
('group_2', 'C'): 1,
('group_2', 'D'): 4,
('group_3', 'A'): 5,
('group_3', 'B'): 1,
('group_3', 'C'): 3,
('group_3', 'D'): 6,
('group_4', 'A'): 4,
('group_4', 'B'): 8,
('group_4', 'C'): 4,
('group_4', 'D'): 10,
('group_5', 'A'): 3,
('group_5', 'B'): 3,
('group_5', 'C'): 5,
('group_5', 'D'): 8,
('group_6', 'A'): 8,
('group_6', 'B'): 7,
('group_6', 'C'): 2,
('group_6', 'D'): 5,
('group_7', 'A'): 3,
('group_7', 'B'): 5,
('group_7', 'C'): 3,
('group_7', 'D'): 7,
('group_8', 'A'): 7,
('group_8', 'B'): 2,
('group_8', 'C'): 6,
('group_8', 'D'): 4,
('group_9', 'A'): 4,
('group_9', 'B'): 7,
('group_9', 'C'): 6,
('group_9', 'D'): 4}}
While using pd.DataFrame.from_dict()
I lose the multiindex, not sure how I can preserve the multiindex hierarchy.
The df.unstack()
looks like this:
I am trying to answer a question - Is there a correlation between the type of customer group and the type of product bought?
I ended up grouping the data by these two dimensions to get a count of how many each customer_group
bought each product_type
, the df
is the result of that.
Now if I use:
df.corr()
I get:
And the product_type
is gone, and now I am thinking if corr()
is not applicable in this type of question?
My desired result would be an answer how correlated are the order combinations of product_type
and customer_group
. And now taking a second look at the first df
, maybe I can say that the combination that had the most sales group_10
& D
are the most correlated?
Upvotes: 1
Views: 162
Reputation: 30579
This is not the answer to your original question (see my comment above), but it's too long for a comment. Based on relative sales you can answer the following questions:
Which product group is the most popular for each customer group?
pd.Series(df.index[(df.to_numpy().T / df.sum(axis=1).to_numpy()).T.argmax(axis=0)], index=df.columns.get_level_values(1))
Answer:
A group_8
B group_2
C group_1
D group_10
Which customer group buys the most of each product group?
pd.Series(df.columns.get_level_values(1)[(df.to_numpy() / df.sum(axis=0).to_numpy()).argmax(axis=1)], index=df.index)
Answer:
group_1 C
group_10 D
group_2 B
group_3 A
group_4 B
group_5 D
group_6 A
group_7 B
group_8 A
group_9 B
Upvotes: 1