TuoCuggino
TuoCuggino

Reputation: 385

Pandas Dataframe - filter data to get unique maximum and minimum rows

I have a dataframe which contains these 4 numeric columns: ['ID', 'A', 'B', 'C']

I want to filter the data in order to obtain a dataframe where, for each unique value in column ID, i get the rows, not duplicated, which correspond the maximum and minimum value of columns A,B,C

The image below shows the input dataframe and the desired output dataframe.

I reported also the df #2 to highlight in blue the rows that are different from a simply max/min searching. Because some of them are duplicated and then should be replaced with the second/third.. maximum/minimum row.

For example, third row of df2 is replaced with the row which contains the second maximum value in column B (63), which is the third row of df1. In the same way, the fourth row of df2 is replaced with the fourth row of df1, because it contains the second minimum of column B (-75)

Moreover:

  1. The number of columns can change, meaning in a larger problem, I could have more columns than just ['A'],['B'], and ['C']

  2. The number of rows for ID can change

  3. The total rows of df3 should be UniqueID*Columns*2

Currently i'm only able to get df2, using idxmax() / idxmin() and then reindex the dataframe

df1 = pd.DataFrame({'ID': pd.Series([1. ,1. , 1. , 1  , 2 , 2, 2,2,2,2,2]),
   'A': pd.Series([100. , -97. , -56. , 69  , 150 , -120, 30,92,35,-41,-75]),
   'B': pd.Series([99., -96., 63., -75., 140, -110, 91,-62,76,10,2]),
   'C': pd.Series([98., -95., -45., 39., 130, -100,90,-50,70,-17,33])})



max = df1.groupby('ID')['A', 'B','C'].idxmax().as_matrix()
min = df1.groupby('ID')['A', 'B','C'].idxmin().as_matrix()

index = []
for i in range(len(max)):
    for j in range(len(max[0])):
        index.append(max[i][j])
        index.append(min[i][j])

df2 = df1.reindex(index)

How can i get df3? The dataframe is large (>1M rows), so I not only need a solution that works, but I also need a solution that is efficient.

Upvotes: 2

Views: 5141

Answers (2)

Lambda
Lambda

Reputation: 1392

Use an auxiliary function:

def filter_min_and_max(x):
    y = pd.DataFrame()
    for col in x.columns:
        if col != "ID":
            y[col] = [max(x[col]), min(x[col])]
            # for OP's comment
            y[col] = [val_1 for val in zip(x[col].nlargest(3).tolist(), x[col].nsmallest(3).tolist()) for val_1 in val]
    return y

df1.groupby("ID").apply(lambda x: filter_min_and_max(x)).reset_index().drop(["level_1"], axis=1)

Upvotes: 2

David Dale
David Dale

Reputation: 11444

There is a fast way to keep only the unique rows: df3 = df1.reindex(set(index)). This will keep only the 1st maximum values. Now you can remove the rows with 1st maximum values from df1 by df1 = df1.drop(df3.index), and repeat the whole procedure as many times as you need (e.g. 3 times)

import pandas as pd
df1 = pd.DataFrame({'ID': pd.Series([1. ,1. , 1. , 1  , 2 , 2, 2,2,2,2,2]),
   'A': pd.Series([100. , -97. , -56. , 69  , 150 , -120, 30,92,35,-41,-75]),
   'B': pd.Series([99., -96., 63., -75., 140, -110, 91,-62,76,10,2]),
   'C': pd.Series([98., -95., -45., 39., 130, -100,90,-50,70,-17,33])})

def keep_minmax(df1):
    df_max = df1.groupby('ID')['A', 'B','C'].idxmax().as_matrix()
    df_min = df1.groupby('ID')['A', 'B','C'].idxmin().as_matrix()
    index = []
    for i in range(len(df_max)):
        for j in range(len(df_max[0])):
            index.append(df_max[i][j])
            index.append(df_min[i][j])
    return df1.reindex(set(index))

df = df1.copy()
results = []
for i in range(3):
    result = keep_minmax(df)
    result['order'] = i + 1
    results.append(result)
    df = df.drop(result.index)
df3 = pd.concat(results).sort_values(['ID', 'order'])
print(df3)

it will output

        A      B      C   ID  order
0   100.0   99.0   98.0  1.0      1
1   -97.0  -96.0  -95.0  1.0      1
2   -56.0   63.0  -45.0  1.0      2
3    69.0  -75.0   39.0  1.0      2
4   150.0  140.0  130.0  2.0      1
5  -120.0 -110.0 -100.0  2.0      1
6    30.0   91.0   90.0  2.0      2
7    92.0  -62.0  -50.0  2.0      2
10  -75.0    2.0   33.0  2.0      2
8    35.0   76.0   70.0  2.0      3
9   -41.0   10.0  -17.0  2.0      3

You can see that for ID=1, there is no 3rd order, because all the rows in df1 are already exhausted, and you would have to include duplicate rows (as in your example df3). Do you really want it?

I am asking this, because from your post it is not clear what to do in ambiguous cases: if different rows correspond to k'th best value in different columns, or if this k itself is different for different columns. For example, what kind of df3 would you produce from such df, and why? For simplicity, let's extract only the max values:

   A  B   ID
0  2  1  1.0
1  3  2  1.0
2  1  0  1.0
3  0  3  1.0

My algorithm (looking only for max) would return

   A  B   ID  order
1  3  2  1.0      1
3  0  3  1.0      1
0  2  1  1.0      2
2  1  0  1.0      3

Notice that the row (2, 1) which is 2nd by A, and 3rd by B, is included into the 2'nd order, because it is higher.

Do you have any alternative suggestion how to handle such ambiguities?

Upvotes: 3

Related Questions