Reputation: 121
Assuming I have two rows where for most of the columns the values are same, but not for all. I would like to group these two rows into one where ever the values are same and if the values are different then create an extra column and assign the column name as 'column1'
Step 1: Here assuming I have columns which has same value in both the rows 'a','b','c' and columns which has different values are 'd','e','f' so I am grouping using 'a','b','c' and then unstacking 'd','e','f'
Step 2: Then I am dropping the levels then renaming it to 'a','b','c','d','d1','e','e1','f','f1'
But in my actual case I have 500+ columns and million rows, I dont know how to expand this to 500+ columns where I have constrains like 1) I dont know which all columns will have same values 2) And which all columns will have different values that needs to be converted into new column after grouping with the columns that has same value
df.groupby(['a','b','c']) ['d','e','f'].apply(lambda x:pd.DataFrame(x.values)).unstack().reset_index()
df.columns = df.columns.droplevel()
df.columns = ['a','b','c','d','d1','e','e1','f','f1']
To be more clear, the below code creates the sample dataframe & expected output
df = pd.DataFrame({'Cust_id':[100,100, 101,101,102,103,104,104], 'gender':['M', 'M', 'F','F','M','F','F','F'], 'Date':['01/01/2019', '02/01/2019','01/01/2019',
'01/01/2019','03/01/2019','04/01/2019','03/01/2019','03/01/2019'],
'Product': ['a','a','b','c','d','d', 'e','e']})
expected_output = pd.DataFrame({'Cust_id':[100, 101,102,103,104], 'gender':['M', 'F','M','F','F'], 'Date':['01/01/2019','01/01/2019','03/01/2019','04/01/2019', '03/01/2019'], 'Date1': ['02/01/2019', 'NA','NA','NA','NA']
, 'Product': ['a', 'b', 'd', 'd','e'], 'Product1':['NA', 'c','NA','NA','NA' ]})
Upvotes: 1
Views: 116
Reputation: 25239
you may do following to get expected_output
from df
s = df.groupby('Cust_id').cumcount().astype(str).replace('0', '')
df1 = df.pivot_table(index=['Cust_id', 'gender'], columns=s, values=['Date', 'Product'], aggfunc='first')
df1.columns = df1.columns.map(''.join)
Out[57]:
Date Date1 Product Product1
Cust_id gender
100 M 01/01/2019 02/01/2019 a a
101 F 01/01/2019 01/01/2019 b c
102 M 03/01/2019 NaN d NaN
103 F 04/01/2019 NaN d NaN
104 F 03/01/2019 03/01/2019 e e
Next, replace columns having duplicated values with NA
df_expected = df1.where(df1.ne(df1.shift(axis=1)), 'NA').reset_index()
Out[72]:
Cust_id gender Date Date1 Product Product1
0 100 M 01/01/2019 02/01/2019 a NA
1 101 F 01/01/2019 NA b c
2 102 M 03/01/2019 NA d NA
3 103 F 04/01/2019 NA d NA
4 104 F 03/01/2019 NA e NA
Upvotes: 1
Reputation: 98
You can try this code - it could be a little cleaner but I think it does the job
df = pd.DataFrame({'a':[100, 100], 'b':['tue', 'tue'], 'c':['yes', 'yes'],
'd':['ok', 'not ok'], 'e':['ok', 'maybe'], 'f':[55, 66]})
df_transformed = pd.DataFrame()
for column in df.columns:
col_vals = df.groupby(column)['b'].count().index.values
for ix, col_val in enumerate(col_vals):
temp_df = pd.DataFrame({column + str(ix) : [col_val]})
df_transformed = pd.concat([df_transformed, temp_df], axis = 1)
Upvotes: 0