Reputation: 157
Here I have a dataset with transactions. Each transaction could have 1+ different values - 'dimensions'. Values could not be the same per transaction. I want to create a dataframe with 'dimensions' in the columns and in lines, and count how many times one dimension was used together with another per transaction.
Here what I tried
dim_set = [ (1, 'Customer group$Large'),
(1, 'DEPARTMENT$Sales'),
(2, 'Customer group$Medium'),
(2, 'DEPARTMENT$Sales'),
(3, 'DEPARTMENT$Sales'),
(4, 'Customer group$Small'),
(4, 'DEPARTMENT$Sales')
]
df = pd.DataFrame(dim_set, columns=['combination_id', 'dimension'])
df
df_st_1 = df.pivot_table(index = 'dimension', columns = 'dimension',values = 'combination_id', aggfunc = 'count')
df_st_1
an expected result should be like this
dim_set = [ ('Customer group$Large', 1, 1, 0, 0),
('DEPARTMENT$Sales', 1, 4, 1, 1),
('Customer group$Medium', 0, 1, 1, 0),
('Customer group$Small', 0, 1, 0, 1)
]
df = pd.DataFrame(dim_set, columns=['dimension','Customer group$Large', 'DEPARTMENT$Sales', 'Customer group$Medium', 'Customer group$Small'])
df
Upvotes: 2
Views: 165
Reputation: 862511
Use DataFrame.merge
with crosstab
, last some data cleaning by DataFrame.reset_index
and DataFrame.rename_axis
:
df1 = df.merge(df, on='combination_id', suffixes=('','_'))
df1 = (pd.crosstab(df1['dimension'], df1['dimension_'])
.reset_index()
.rename_axis(None)
.rename_axis(None, axis=1))
print (df1)
dimension Customer group$Large Customer group$Medium \
0 Customer group$Large 1 0
1 Customer group$Medium 0 1
2 Customer group$Small 0 0
3 DEPARTMENT$Sales 1 1
Customer group$Small DEPARTMENT$Sales
0 0 1
1 0 1
2 1 1
3 1 4
Upvotes: 1