Reputation: 553
I have a table material
+--------+-----+-------------------+----------------+-----------+
| ID | REV | name | Description | curr |
+--------+-----+-------------------+----------------+-----------+
| 211-32 | 001 | Screw 1.0 | Used in MAT 1 | READY |
| 211-32 | 002 | Screw 2 plus | can be Used-32 | WITHDRAWN |
| 212-41 | 001 | Bolt H1 | Light solid | READY |
| 212-41 | 002 | BOLT H2+Form | Heavy solid | READY |
| 101-24 | 001 | HexHead 1-A | NOR-1 | READY |
| 101-24 | 002 | HexHead Spl | NOR-22 | READY |
| 423-98 | 001 | Nut Repair spare | NORM1 | READY |
| 423-98 | 002 | Nut Repair Part-C | NORM2 | WITHDRAWN |
| 423-98 | 003 | Nut SP-C | NORM2+NORM1 | NULL |
| 654-01 | 001 | Bar | Specific only | WITHDRAWN |
| 654-01 | 002 | Bar rod-S | Designed+Spe | WITHDRAWN |
| 654-01 | 003 | Bar OPG | Hard spec | NULL |
+--------+-----+-------------------+----------------+-----------+
Here each ID can have multiple revisions. I want to take latest revisions (i.e highest of 001,002,003 etc.,). But If the latest revision has curr
as either NULL
(string) or WITHDRAWN
then I have take the previous revision and its corresponding value. If even that's curr
is NULL
or WITHDRAWN
I have to again go to previous revision. If all the revision has the same issue then we can ignore it. so the expected output is
+--------+-----+------------------+---------------+-------+
| ID | REV | name | Description | curr |
+--------+-----+------------------+---------------+-------+
| 211-32 | 001 | Screw 1.0 | Used in MAT 1 | READY |
| 212-41 | 002 | BOLT H2+Form | Heavy solid | READY |
| 101-24 | 002 | HexHead Spl | NOR-22 | READY |
| 423-98 | 001 | Nut Repair spare | NORM1 | READY |
+--------+-----+------------------+---------------+-------+
I'm very new to Python. I have tried below code, but i'm not working. Any suggestion are highly appreciated.
import pandas as pd
import numpy as np
mydata = pd.read_csv('C:/Myfolder/Python/myfile.csv')
mydata.sort_values(['ID','REV'], ascending=[True, False]).drop_duplicates('',keep=last)
Upvotes: 1
Views: 73
Reputation: 123
I think first you should do is remove the NULL or WITHDRAW from the table.
mydata[mydata[curr] == 'Ready'] # this should do I think...
then you can try your sort and take the max rev value.
mydata = mydata.sort_values(['ID','REV']).drop_duplicates('ID',keep='last')
Upvotes: 1
Reputation: 23099
We can create a psuedo column to get the max and return its index.
first step is to filter out the values we want to ignore.
df1 = df.loc[
df[~df["curr"].isin(["WITHDRAWN", "NULL"])]
.assign(key=df["REV"].astype(int))
.groupby("ID")["key"]
.idxmax()
]
ID REV name Description curr
6 101-24 002 HexHead Spl NOR-22 READY
1 211-32 001 Screw 1.0 Used in MAT 1 READY
4 212-41 002 BOLT H2+Form Heavy solid READY
7 423-98 001 Nut Repair spare NORM1 READY
Upvotes: 2
Reputation: 29635
you can select the rows that don't have NULL or WITHDRAW in it with isin
, then do the sort_values
and drop_duplicates
:
mydata = mydata[~mydata['curr'].isin(['NULL','WITHDRAW'])]
mydata = mydata.sort_values(['ID','REV']).drop_duplicates('ID',keep='last')
Upvotes: 2