Python Pandas: Delete duplicate rows based on one column and concatenate information from multiple columns

Question

I have a pandas dataframe that contains duplicates according to one column (ID), but has differing values in several other columns. My goal is to remove the duplicates based on ID, but to concatenate the information from the other columns.

Here is an example of what I'm working with:

ID    Age   Gender   Form          Signature   Level
000   30    M        Paper         Yes         A
000   30    M        Electronic    No          B
001   42             Paper         No          B

After processing, I would like the data to look like this:

ID    Age   Gender   Form                      Signature   Level
000   30    M        Paper, Electronic         Yes, No     A, B
001   42             Paper                     No          B

First, I filled the nAn cells with "Not Noted" so that I can use the groupby function. I tried the following code:

df = df.groupby(['ID', 'Age', 'Gender'])['Form'].apply(set).reset_index()

This takes care of concatenating the Form column, but I cannot figure out how to incorporate the Signature and Level columns as well. Does anyone have any suggestions?

David M. · Accepted Answer

You can do it like this:

import pandas as pd

df = pd.DataFrame({'ID': ['000', '000', '001'],
                   'Age': [30, 30, 42],
                   'Gender': ['M', 'M', ''],
                   'Form': ['Paper', 'Electronic', 'Paper'],
                   'Signature': ['Yes', 'No', 'No'],
                   'Level': ['A', 'B', 'B']})

df = df.groupby(['ID', 'Age', 'Gender']).agg({'Form': set, 'Signature': set, 'Level': set}).reset_index()

print(df)

Output:

    ID  Age Gender                 Form  Signature   Level
0  000  30   M      {Electronic, Paper}  {No, Yes}  {B, A}
1  001  42          {Paper}              {No}       {B}

Python Pandas: Delete duplicate rows based on one column and concatenate information from multiple columns

Answers (2)

Related Questions