Jonas Palačionis
Jonas Palačionis

Reputation: 4842

Correlation between the count of combinations

I have a df:

my_data = {'order_id': {('group_1', 'A'): 2,
  ('group_1', 'B'): 4,
  ('group_1', 'C'): 6,
  ('group_1', 'D'): 5,
  ('group_10', 'A'): 6,
  ('group_10', 'B'): 4,
  ('group_10', 'C'): 7,
  ('group_10', 'D'): 13,
  ('group_2', 'A'): 3,
  ('group_2', 'B'): 4,
  ('group_2', 'C'): 1,
  ('group_2', 'D'): 4,
  ('group_3', 'A'): 5,
  ('group_3', 'B'): 1,
  ('group_3', 'C'): 3,
  ('group_3', 'D'): 6,
  ('group_4', 'A'): 4,
  ('group_4', 'B'): 8,
  ('group_4', 'C'): 4,
  ('group_4', 'D'): 10,
  ('group_5', 'A'): 3,
  ('group_5', 'B'): 3,
  ('group_5', 'C'): 5,
  ('group_5', 'D'): 8,
  ('group_6', 'A'): 8,
  ('group_6', 'B'): 7,
  ('group_6', 'C'): 2,
  ('group_6', 'D'): 5,
  ('group_7', 'A'): 3,
  ('group_7', 'B'): 5,
  ('group_7', 'C'): 3,
  ('group_7', 'D'): 7,
  ('group_8', 'A'): 7,
  ('group_8', 'B'): 2,
  ('group_8', 'C'): 6,
  ('group_8', 'D'): 4,
  ('group_9', 'A'): 4,
  ('group_9', 'B'): 7,
  ('group_9', 'C'): 6,
  ('group_9', 'D'): 4}}

While using pd.DataFrame.from_dict() I lose the multiindex, not sure how I can preserve the multiindex hierarchy.

The df.unstack() looks like this:

enter image description here

I am trying to answer a question - Is there a correlation between the type of customer group and the type of product bought?

I ended up grouping the data by these two dimensions to get a count of how many each customer_group bought each product_type, the df is the result of that.

Now if I use:

df.corr() I get:

enter image description here

And the product_type is gone, and now I am thinking if corr() is not applicable in this type of question?

My desired result would be an answer how correlated are the order combinations of product_type and customer_group. And now taking a second look at the first df, maybe I can say that the combination that had the most sales group_10 & D are the most correlated?

Upvotes: 1

Views: 162

Answers (1)

Stef
Stef

Reputation: 30579

This is not the answer to your original question (see my comment above), but it's too long for a comment. Based on relative sales you can answer the following questions:

Which product group is the most popular for each customer group?

pd.Series(df.index[(df.to_numpy().T / df.sum(axis=1).to_numpy()).T.argmax(axis=0)], index=df.columns.get_level_values(1))

Answer:

A     group_8
B     group_2
C     group_1
D    group_10

Which customer group buys the most of each product group?

pd.Series(df.columns.get_level_values(1)[(df.to_numpy() / df.sum(axis=0).to_numpy()).argmax(axis=1)], index=df.index)

Answer:

group_1     C
group_10    D
group_2     B
group_3     A
group_4     B
group_5     D
group_6     A
group_7     B
group_8     A
group_9     B

Upvotes: 1

Related Questions