Maikiii
Maikiii

Reputation: 451

Python : Dropping specific rows in a dataframe and keep a specif one

Let's say that I have this dataframe

Name = ['ID', 'Country', 'IBAN','ID_info_1', 'Dan_Age', 'ID_info_1','Dan_city','ID_info_1','Dan_country','ID_info_1', 'ID_info_2', 'ID_info_2','ID_info_2', 'Dan_sex', 'Dan_Age', 'Dan_country','Dan_sex' , 'Dan_city','Dan_country' ]
Value = ['TAMARA_CO', 'GERMANY','FR56', '12', '18','25','Berlin','34', '55','345','432', '43', 'GER', 'M', '22', 'FRA', 'M', 'Madrid', 'ESP']
Ccy = ['','','','EUR','EUR','EUR','','EUR','','','','EUR','EUR','USD','USD','','CHF', '','DKN']
Group = ['0','0','0','1','1','2','2','3','3','4','1','2','3','4','2','2','2','3','3']
df = pd.DataFrame({'Name':Name, 'Value' : Value, 'Ccy' : Ccy,'Group':Group})

print(df)

           Name      Value  Ccy Group
0            ID  TAMARA_CO          0
1       Country    GERMANY          0
2          IBAN       FR56          0
3     ID_info_1         12  EUR     1
4       Dan_Age         18  EUR     1
5     ID_info_1         25  EUR     2
6      Dan_city     Berlin          2
7     ID_info_1         34  EUR     3
8   Dan_country         55          3
9     ID_info_1        345          4
10    ID_info_2        432          1
11    ID_info_2         43  EUR     2
12    ID_info_2        GER  EUR     3
13      Dan_sex          M  USD     4
14      Dan_Age         22  USD     2
15  Dan_country        FRA          2
16      Dan_sex          M  CHF     2
17     Dan_city     Madrid          3
18  Dan_country        ESP  DKN     3

I want to reduce this dataframe ! I want to reduce only the rows that contains the string "info" by keeping the ones that have the highest level in the column "Group". So in this dataframe, it would mean that I keep the row "ID_info_1" in the group 4, and "ID_info_1" in the group 3. In addition, I want to change the their value in the column "Group" to 1.

So at the end I would like to get this new dataframe where the indexing is reset too

           Name      Value  Ccy Group
0            ID  TAMARA_CO          0
1       Country    GERMANY          0
2          IBAN       FR56          0
3     ID_info_1         12  EUR     1
4       Dan_Age         18  EUR     1
5      Dan_city     Berlin          2
6   Dan_country         55          3
7     ID_info_1        345          1
8     ID_info_2        GER  EUR     1
9       Dan_sex          M  USD     4
10      Dan_Age         22  USD     2
11  Dan_country        FRA          2
12      Dan_sex          M  CHF     2
13     Dan_city     Madrid          3
14  Dan_country        ESP  DKN     3

Anyone has an efficient idea ?

Thank you

Upvotes: 0

Views: 67

Answers (2)

Fjord
Fjord

Reputation: 31

You can create a mask using a lambda function that searches for the string 'info' in the Name column and the value in the Group Column.

arr = []
mask = df.apply(lambda x: True if 'info' in x['Name'] else False, axis=1)
for info in df[mask]['Name'].unique():
    min_val = df.loc[df['Name'] == info]['Group'].min()
    arr += list(df[(df['Name'] == info) & (df['Group'] > min_val)].index)

df.drop(arr, inplace=True)
df.reset_index(inplace=True)


       Name      Value  Ccy     Group
0            ID  TAMARA_CO          0
1       Country    GERMANY          0
2          IBAN       FR56          0
3     ID_info_1         12  EUR     1
4       Dan_Age         18  EUR     1
5      Dan_city     Berlin          2
6   Dan_country         55          3
7     ID_info_2        432          1
8       Dan_sex          M  USD     4
9       Dan_Age         22  USD     2
10  Dan_country        FRA          2
11      Dan_sex          M  CHF     2
12     Dan_city     Madrid          3
13  Dan_country        ESP  DKN     3

I Know that the df doesn't look 100p like the one you want, but this is how I understod your question. Let me know if i am wrong.

EDIT Reread the question and edited some code.

Upvotes: 1

JBernardo
JBernardo

Reputation: 33387

How about this:

# select rows with "info"
di = df[df.Name.str.contains('info')]

# Find the rows below max for removal
di = di[di.groupby('Name')['Group'].transform('max') != di['Group']]

# Remove those rows and set a new index as requested
df = df.drop(di.index).reset_index(drop=True)

# Change group to one on remaining "info" rows
df.loc[df.Name.str.contains('info'), 'Group'] = 1

Upvotes: 2

Related Questions