Anna
Anna

Reputation: 43

Python Pandas: Delete duplicate rows based on one column and concatenate information from multiple columns

I have a pandas dataframe that contains duplicates according to one column (ID), but has differing values in several other columns. My goal is to remove the duplicates based on ID, but to concatenate the information from the other columns.

Here is an example of what I'm working with:

ID    Age   Gender   Form          Signature   Level
000   30    M        Paper         Yes         A
000   30    M        Electronic    No          B
001   42             Paper         No          B

After processing, I would like the data to look like this:

ID    Age   Gender   Form                      Signature   Level
000   30    M        Paper, Electronic         Yes, No     A, B
001   42             Paper                     No          B

First, I filled the nAn cells with "Not Noted" so that I can use the groupby function. I tried the following code:

df = df.groupby(['ID', 'Age', 'Gender'])['Form'].apply(set).reset_index()

This takes care of concatenating the Form column, but I cannot figure out how to incorporate the Signature and Level columns as well. Does anyone have any suggestions?

Upvotes: 2

Views: 1099

Answers (2)

David M.
David M.

Reputation: 4588

You can do it like this:

import pandas as pd

df = pd.DataFrame({'ID': ['000', '000', '001'],
                   'Age': [30, 30, 42],
                   'Gender': ['M', 'M', ''],
                   'Form': ['Paper', 'Electronic', 'Paper'],
                   'Signature': ['Yes', 'No', 'No'],
                   'Level': ['A', 'B', 'B']})

df = df.groupby(['ID', 'Age', 'Gender']).agg({'Form': set, 'Signature': set, 'Level': set}).reset_index()

print(df)

Output:

    ID  Age Gender                 Form  Signature   Level
0  000  30   M      {Electronic, Paper}  {No, Yes}  {B, A}
1  001  42          {Paper}              {No}       {B} 

Upvotes: 2

LQR471814
LQR471814

Reputation: 41

You can do this by modifying each column separately and then concatenating them with some basic list comprehension and the pd.concat function.

g = df.groupby(['ID', 'Age', 'Gender'])
concatCols = ['Form', 'Signature', 'Level'] #? Used to define which columns should be concatenated
df = pd.concat([g[c].apply(set) for c in concatCols], axis=1).reset_index()
print(df)

Upvotes: 3

Related Questions