Lei Hao
Lei Hao

Reputation: 799

Error in using Pandas groupby.apply to drop duplication

I have a Pandas data frame which has some duplicate values, not rows. I want to use groupby.apply to remove the duplication. An example is as follows.

df = pd.DataFrame([['a', 1, 1], ['a', 1, 2], ['b', 1, 1]], columns=['A', 'B', 'C'])
   A  B  C
0  a  1  1
1  a  1  2
2  b  1  1

# My function
def get_uniq_t(df):
    if df.shape[0] > 1:
        df['D'] = df.C * 10 + df.B
        df = df[df.D == df.D.max()].drop(columns='D')
    return df

df = df.groupby('A').apply(get_uniq_t)

Then I get the following value error message. The issue seems to do with creating the new column D. If I create the column D outside the function, the code seems running fine. Can someone help explain what caused the value error message?

ValueError: Shape of passed values is (3, 3), indices imply (2, 3)

Upvotes: 1

Views: 98

Answers (1)

Valdi_Bo
Valdi_Bo

Reputation: 30981

The problem with your code is that it attempts to modify the original group.

Other problem is that this function should return a single row not a DataFrame.

Change your function to:

def get_uniq_t(df):
    iMax = (df.C * 10 + df.B).idxmax()
    return df.loc[iMax]

Then its application returns:

   A  B  C
A         
a  a  1  2
b  b  1  1

Edit following the comment

In my opinion, it is not allowed to modify the original group, as it would indirectly modify the original DataFrame.

At least it displays a warning about this and is considered a bad practice. Search the Web for SettingWithCopyWarning for more extensive description.

My code (get_uniq_t function) does not modify the original group. It only returns one row from the current group.

The returned row is selected based on which row returns the greatest value of df.C * 10 + df.B. So when you apply this function, the result is a new DataFrame, with consecutive rows equal to results of this function for consecutive groups.

You can perform an operation equivalent to modification, when you create some new content, e.g. as the result of groupby instruction and then save it under the same variable which so far held the source DataFrame.

Upvotes: 2

Related Questions