no33mis
no33mis

Reputation: 11

Drop list (in list) elements/dataframe rows by column comparison

I am struggling with a problem for a while now. I have a list consisting of other lists which looks like this when printed element by element:

[120, 2, 'germany']
[122, 1, 'germany']
[123, 1, 'france']
[122, 1, 'germany']
[120, 2, 'germany']
[123, 1, 'france']

I would like to drop the sublists which have the same value in the last "column" (index[2]), but only the ones that have a lower value in the middle "column" (index[1]) than the rows with the same last index[2]. No worries, I am aware that this is a list so far and has no actual columns. Hence, the output should like like:

[120, 2, 'germany']
[123, 1, 'france']
[120, 2, 'germany']
[123, 1, 'france']

I have tried multiple approaches so far, from single loops over simultaneous loops up to creating a copy of the list and iterate again. After a while, I have transformed the list into a pandas dataframe (although I will have to transform it back to list for further processing) and tried several methods again, without success so far.

The closest I got is:

dfList = pd.DataFrame(list, columns =['ID',"status","country"])
dfList = dfList.groupby('country').filter(lambda g: len(g) > 1).drop_duplicates(subset=['country', 'status'], keep="first")

However, this will only provide me with each unique value within the dataframe. The difficulty lies in the fact that the code needs to be reusable with other values as well, hence I cannot filter with the actual values.

I would be very thankful for some thoughts or ideas.

Upvotes: 0

Views: 101

Answers (4)

perl
perl

Reputation: 9941

Here's with transform to get max status within each group:

dfList[
    dfList['status'].eq(
        dfList.groupby('country')['status'].transform('max'))]

Output:

    ID  status  country
0  120   2      germany
2  123   1      france 
4  120   2      germany
5  123   1      france 

Upvotes: 1

Sayandip Dutta
Sayandip Dutta

Reputation: 15872

You can try the following:

>>> (dfList.groupby('country', group_keys=False)
           .apply(lambda g: g[g.status == g.status.max()])
           .sort_index())
    ID  status  country
0  120       2  germany
2  123       1   france
4  120       2  germany
5  123       1   france

Upvotes: 1

Arseniy
Arseniy

Reputation: 690

if I understood the question correctly, i can suggest this approach as a basis for your code

rows=[[120, 2, 'germany'],
[122, 1, 'germany'],
[123, 1, 'france'],
[122, 1, 'germany'],
[120, 2, 'germany'],
[123, 1, 'france']]


rd={}
for row in rows:
    if row[2] not in rd:
        rd[row[2]]=row[1]
        print(row)
    elif row[1] >= rd[row[2]]:
        rd[row[2]]=row[1]
        print(row)

The output will be

[120, 2, 'germany']
[123, 1, 'france']
[120, 2, 'germany']
[123, 1, 'france']

just like you wanted

Upvotes: 0

jonathan
jonathan

Reputation: 269

try to use this code

data = [[120, 2, 'germany']
    , [122, 1, 'germany']
    , [123, 1, 'france']
    , [122, 1, 'germany']
    , [120, 2, 'germany']
    , [123, 1, 'france']]

d = {}

for i in data:
    if not d.__contains__(i[2]):
        d[i[2]] = i
    elif d[i[2]][1] < i[1]:
        d[i[2]] = i

for i in d.values():
    print(i)

Upvotes: 0

Related Questions