Reputation: 799
I have a Pandas data frame which has some duplicate values, not rows. I want to use groupby.apply
to remove the duplication. An example is as follows.
df = pd.DataFrame([['a', 1, 1], ['a', 1, 2], ['b', 1, 1]], columns=['A', 'B', 'C'])
A B C
0 a 1 1
1 a 1 2
2 b 1 1
# My function
def get_uniq_t(df):
if df.shape[0] > 1:
df['D'] = df.C * 10 + df.B
df = df[df.D == df.D.max()].drop(columns='D')
return df
df = df.groupby('A').apply(get_uniq_t)
Then I get the following value error message. The issue seems to do with creating the new column D. If I create the column D outside the function, the code seems running fine. Can someone help explain what caused the value error message?
ValueError: Shape of passed values is (3, 3), indices imply (2, 3)
Upvotes: 1
Views: 98
Reputation: 30981
The problem with your code is that it attempts to modify the original group.
Other problem is that this function should return a single row not a DataFrame.
Change your function to:
def get_uniq_t(df):
iMax = (df.C * 10 + df.B).idxmax()
return df.loc[iMax]
Then its application returns:
A B C
A
a a 1 2
b b 1 1
In my opinion, it is not allowed to modify the original group, as it would indirectly modify the original DataFrame.
At least it displays a warning about this and is considered a bad practice. Search the Web for SettingWithCopyWarning for more extensive description.
My code (get_uniq_t function) does not modify the original group. It only returns one row from the current group.
The returned row is selected based on which row returns the greatest value
of df.C * 10 + df.B
. So when you apply this function, the result is a new
DataFrame, with consecutive rows equal to results of this function
for consecutive groups.
You can perform an operation equivalent to modification, when you create some new content, e.g. as the result of groupby instruction and then save it under the same variable which so far held the source DataFrame.
Upvotes: 2