Reputation: 1

pandas drop function on duplicates is deleting invalid rows

Is the drop function wrong? Need expert guidance on how to fix this:

Rows for certain date may appear in multiple files
I remove rows with duplicate date,time from cs...
...then compare a column (val2), only keeping rows where the first row of val2 is highest

Code:

cs = pd.concat([pd.read_csv(f) for f in fnames])
dp = cs[cs.duplicated(['date','time'],keep=False)]
dp = dp.sort_values(['date','time'],ascending=True)

i=0
while len(dp)>0:
    if dp.values[i][3] > dp.values[i+1][3]:
        if dp.index[i] > dp.index[i+1]:
            cs.drop(cs[(cs.date==dp.values[i][0]) & (cs.index < dp.index[i])].index, inplace=True)
            dp = cs[cs.duplicated(['date','time'],keep=False)]
            dp = dp.sort_values(['date','time'],ascending=True)

Sample data:

file,date,time,val1,val2
f1,20jun,01:00,10,210
f1,20jun,02:00,10,110
f2,20jun,01:00,10,320
f2,20jun,02:00,10,50
f2,21jun,01:00,10,130
f2,21jun,02:00,10,230

Expected output:

date,time,val1,val2
20jun,01:00,10,320
20jun,02:00,10,50
21jun,01:00,10,130
21jun,02:00,10,230

Actual output:

date,time,val1,val2
20jun,01:00,10,320
20jun,02:00,10,50
21jun,01:00,10,130

Upvotes: 0

Answers (3)

spx

Reputation: 1

The 'invalid' rows got deleted because of the index. After the pd.concat, need to reset the index using cs.reset_index(inplace=True,drop=True). Without the reset index, the index would start from 0 for each file. And it got deleted by the drop function because some of the index values were repeated.

Though I can get the correct result, still the filtering by date column is actually not working ( cs.drop(cs[(cs.date==dp.values[0][0]) ). Should it work, i didn't have to 'reset' the index. Or, did i use it wrongly?

Thanks all for your help. If you have a better and elegant way to get the expected output, that would be appreciated.

Best regards.

cs = pd.concat([pd.read_csv(f) for f in fnames])
cs.reset_index(inplace=True,drop=True)
dp = cs[cs.duplicated(['date','time'],keep=False)]
dp = dp.sort_values(['date','time'],ascending=True)

while len(dp)>0:
    if dp.values[0][3] > dp.values[1][3]:
        if dp.index[0] > dp.index[1]:
            cs.drop(cs[(cs.date==dp.values[0][0]) & (cs.index < dp.index[0])].index, inplace=True)
            dp.drop(dp[(dp.date==dp.values[0][0])].index, inplace=True)

Upvotes: 0

michcio1234

Reputation: 1838

EDITED ANSWER:

After discussion in the comments I think this is what you need (I added some code for reproducing the problem):

import pandas as pd
from io import StringIO

input_string = """file,date,time,val1,val2
f1,20jun,01:00,10,210
f1,20jun,02:00,10,110
f2,20jun,01:00,10,320
f2,20jun,02:00,10,50
f2,21jun,01:00,10,130
f2,21jun,02:00,10,230"""

buf = StringIO(input_string)
cs = pd.read_csv(buf)

def pick_file(df):
    first = df.groupby('file').first()
    file = first['val2'].idxmax()
    return df[df['file'] == file]

result = cs.groupby(['date']).apply(pick_file)

result = result.reset_index(level=0, drop=True)

The result is:

  file   date   time  val1  val2
2   f2  20jun  01:00    10   320
3   f2  20jun  02:00    10    50
4   f2  21jun  01:00    10   130
5   f2  21jun  02:00    10   230

This does some inception: a groupby inside a groupby.

Outer groupby groups by date because these are the groups we want to search for the file in.

Inner groupby searches for the correct file within a group and keeps only the rows from this file.

ORIGINAL ANSWER:

Instead of constructing a dataframe with duplicates and iterating through it, you can just use groupby:

cs = pd.concat([pd.read_csv(f) for f in fnames])
result = cs.groupby(['date', 'time'])\
    .apply(lambda x: x[x['val2']==x['val2'].max()])

It groups all the rows which have the same values in date and time columns and then, for each group, it leaves only the row with the highest value of val2.

The result is:

              file   date   time  val1  val2
date  time                                  
20jun 01:00 2   f2  20jun  01:00    10   320
      02:00 3   f2  20jun  02:00    10   220
21jun 01:00 4   f2  21jun  01:00    10   130
      02:00 5   f2  21jun  02:00    10   230

Upvotes: 2

Valdi_Bo

Reputation: 31011

To delete the rows you want:

with repeating values of date and time,
keeping the last row from the repeated set,

use:

cs.drop_duplicates(subset=['date', 'time'], keep='last', inplace=True)

There is no need for "initial sort" of the source rows.

Edit

As you wrote that from each set of repeating rows you want to keep the row with the highest val2:

Add ignore_index=True to pd.concat. This way you will have the "ordered" index, needed to restore the initial row order (in the last step).

Then sort rows:

cs.sort_values(['date','time','val2'])

so that in any repeating group (by date and time), the row with the highest val2 is at the last place.

The third step is:

cs.drop_duplicates(subset=['date', 'time'], keep='last', inplace=True)

as in my first proposal.

And the final step, to restore the original order of rows, sort them once more, this time by the index (in-place), run:

cs.sort_index(inplace=True)

Upvotes: 1

pandas drop function on duplicates is deleting invalid rows

Answers (3)

Edit

Related Questions