Reputation: 43
I have a pandas dataframe that contains duplicates according to one column (ID), but has differing values in several other columns. My goal is to remove the duplicates based on ID, but to concatenate the information from the other columns.
Here is an example of what I'm working with:
ID Age Gender Form Signature Level
000 30 M Paper Yes A
000 30 M Electronic No B
001 42 Paper No B
After processing, I would like the data to look like this:
ID Age Gender Form Signature Level
000 30 M Paper, Electronic Yes, No A, B
001 42 Paper No B
First, I filled the nAn cells with "Not Noted" so that I can use the groupby function. I tried the following code:
df = df.groupby(['ID', 'Age', 'Gender'])['Form'].apply(set).reset_index()
This takes care of concatenating the Form column, but I cannot figure out how to incorporate the Signature and Level columns as well. Does anyone have any suggestions?
Upvotes: 2
Views: 1099
Reputation: 4588
You can do it like this:
import pandas as pd
df = pd.DataFrame({'ID': ['000', '000', '001'],
'Age': [30, 30, 42],
'Gender': ['M', 'M', ''],
'Form': ['Paper', 'Electronic', 'Paper'],
'Signature': ['Yes', 'No', 'No'],
'Level': ['A', 'B', 'B']})
df = df.groupby(['ID', 'Age', 'Gender']).agg({'Form': set, 'Signature': set, 'Level': set}).reset_index()
print(df)
Output:
ID Age Gender Form Signature Level
0 000 30 M {Electronic, Paper} {No, Yes} {B, A}
1 001 42 {Paper} {No} {B}
Upvotes: 2
Reputation: 41
You can do this by modifying each column separately and then concatenating them with some basic list comprehension and the pd.concat
function.
g = df.groupby(['ID', 'Age', 'Gender'])
concatCols = ['Form', 'Signature', 'Level'] #? Used to define which columns should be concatenated
df = pd.concat([g[c].apply(set) for c in concatCols], axis=1).reset_index()
print(df)
Upvotes: 3