Reputation: 1
Is the drop function wrong? Need expert guidance on how to fix this:
date,time
from cs
...val2
), only keeping rows where the first row of val2
is highestCode:
cs = pd.concat([pd.read_csv(f) for f in fnames])
dp = cs[cs.duplicated(['date','time'],keep=False)]
dp = dp.sort_values(['date','time'],ascending=True)
i=0
while len(dp)>0:
if dp.values[i][3] > dp.values[i+1][3]:
if dp.index[i] > dp.index[i+1]:
cs.drop(cs[(cs.date==dp.values[i][0]) & (cs.index < dp.index[i])].index, inplace=True)
dp = cs[cs.duplicated(['date','time'],keep=False)]
dp = dp.sort_values(['date','time'],ascending=True)
Sample data:
file,date,time,val1,val2
f1,20jun,01:00,10,210
f1,20jun,02:00,10,110
f2,20jun,01:00,10,320
f2,20jun,02:00,10,50
f2,21jun,01:00,10,130
f2,21jun,02:00,10,230
Expected output:
date,time,val1,val2
20jun,01:00,10,320
20jun,02:00,10,50
21jun,01:00,10,130
21jun,02:00,10,230
Actual output:
date,time,val1,val2
20jun,01:00,10,320
20jun,02:00,10,50
21jun,01:00,10,130
Upvotes: 0
Views: 119
Reputation: 1
The 'invalid' rows got deleted because of the index. After the pd.concat, need to reset the index using cs.reset_index(inplace=True,drop=True). Without the reset index, the index would start from 0 for each file. And it got deleted by the drop function because some of the index values were repeated.
Though I can get the correct result, still the filtering by date column is actually not working ( cs.drop(cs[(cs.date==dp.values[0][0]) ). Should it work, i didn't have to 'reset' the index. Or, did i use it wrongly?
Thanks all for your help. If you have a better and elegant way to get the expected output, that would be appreciated.
Best regards.
cs = pd.concat([pd.read_csv(f) for f in fnames])
cs.reset_index(inplace=True,drop=True)
dp = cs[cs.duplicated(['date','time'],keep=False)]
dp = dp.sort_values(['date','time'],ascending=True)
while len(dp)>0:
if dp.values[0][3] > dp.values[1][3]:
if dp.index[0] > dp.index[1]:
cs.drop(cs[(cs.date==dp.values[0][0]) & (cs.index < dp.index[0])].index, inplace=True)
dp.drop(dp[(dp.date==dp.values[0][0])].index, inplace=True)
Upvotes: 0
Reputation: 1838
EDITED ANSWER:
After discussion in the comments I think this is what you need (I added some code for reproducing the problem):
import pandas as pd
from io import StringIO
input_string = """file,date,time,val1,val2
f1,20jun,01:00,10,210
f1,20jun,02:00,10,110
f2,20jun,01:00,10,320
f2,20jun,02:00,10,50
f2,21jun,01:00,10,130
f2,21jun,02:00,10,230"""
buf = StringIO(input_string)
cs = pd.read_csv(buf)
def pick_file(df):
first = df.groupby('file').first()
file = first['val2'].idxmax()
return df[df['file'] == file]
result = cs.groupby(['date']).apply(pick_file)
result = result.reset_index(level=0, drop=True)
The result is:
file date time val1 val2
2 f2 20jun 01:00 10 320
3 f2 20jun 02:00 10 50
4 f2 21jun 01:00 10 130
5 f2 21jun 02:00 10 230
This does some inception: a groupby inside a groupby.
Outer groupby groups by date because these are the groups we want to search for the file in.
Inner groupby searches for the correct file within a group and keeps only the rows from this file.
ORIGINAL ANSWER:
Instead of constructing a dataframe with duplicates and iterating through it, you can just use groupby
:
cs = pd.concat([pd.read_csv(f) for f in fnames])
result = cs.groupby(['date', 'time'])\
.apply(lambda x: x[x['val2']==x['val2'].max()])
It groups all the rows which have the same values in date
and time
columns and then, for each group, it leaves only the row with the highest value of val2
.
The result is:
file date time val1 val2
date time
20jun 01:00 2 f2 20jun 01:00 10 320
02:00 3 f2 20jun 02:00 10 220
21jun 01:00 4 f2 21jun 01:00 10 130
02:00 5 f2 21jun 02:00 10 230
Upvotes: 2
Reputation: 31011
To delete the rows you want:
use:
cs.drop_duplicates(subset=['date', 'time'], keep='last', inplace=True)
There is no need for "initial sort" of the source rows.
As you wrote that from each set of repeating rows you want to keep the row with the highest val2:
Add ignore_index=True to pd.concat. This way you will have the "ordered" index, needed to restore the initial row order (in the last step).
Then sort rows:
cs.sort_values(['date','time','val2'])
so that in any repeating group (by date and time), the row with the highest val2 is at the last place.
The third step is:
cs.drop_duplicates(subset=['date', 'time'], keep='last', inplace=True)
as in my first proposal.
And the final step, to restore the original order of rows, sort them once more, this time by the index (in-place), run:
cs.sort_index(inplace=True)
Upvotes: 1