Mark Ginsburg
Mark Ginsburg

Reputation: 2269

Dropping selected rows in Pandas with duplicated columns

Suppose I have a dataframe like this:

fname    lname     email

Joe      Aaron   
Joe      Aaron     [email protected]
Bill     Smith 
Bill     Smith
Bill     Smith     [email protected]

Is there a terse and convenient way to drop rows where {fname, lname} is duplicated and email is blank?

Upvotes: 0

Views: 68

Answers (2)

jpp
jpp

Reputation: 164613

You should first check whether your "empty" data is NaN or empty strings. If they are a mixture, you may need to modify the below logic.

If empty rows are NaN

Using pd.DataFrame.sort_values and pd.DataFrame.drop_duplicates:

df = df.sort_values('email')\
       .drop_duplicates(['fname', 'lname'])

If empty rows are strings

If your empty rows are strings, you need to specify ascending=False when sorting:

df = df.sort_values('email', ascending=False)\
       .drop_duplicates(['fname', 'lname'])

Result

print(df)

  fname  lname           email
4  Bill  Smith  [email protected]
1   Joe  Aaron   [email protected]

Upvotes: 1

BENY
BENY

Reputation: 323226

You can using first with groupby (Notice replace empty with np.nan, since the first will return the first not null value for each columns)

df.replace('',np.nan).groupby(['fname','lname']).first().reset_index()
Out[20]: 
  fname  lname           email
0  Bill  Smith  [email protected]
1   Joe  Aaron   [email protected]

Upvotes: 0

Related Questions