Reputation: 551
Given a DataFrame:
df = pd.DataFrame(
{'AgeAtMedStart': {1: -46.47, 2: 46.47, 3: 46.8, 4: 51.5, 5: 51.5},
'AgeAtMedStop': {1: 46.8, 2: 46.8, 3: nan, 4: -51.9, 5: 51.81},
'MedContinuing': {1: 'No', 2: 'No', 3: 'Yes', 4: 'No', 5: 'No'},
'Medication': {1: 'Med1', 2: 'Med2', 3: 'Med3', 4: 'Med4', 5: 'Med4'},
'YearOfMedStart': {1: 2016.0, 2: 2016.0, 3: 2016.0, 4: 2016.0, 5: 2016.0}}
)
df
AgeAtMedStart AgeAtMedStop MedContinuing Medication YearOfMedStart
1 -46.47 46.80 No Med1 2016.0
2 46.47 46.80 No Med2 2016.0
3 46.80 NaN Yes Med3 2016.0
4 51.50 -51.90 No Med4 2016.0
5 51.50 51.81 No Med4 2016.0
I want to filter to retain rows where any of the numeric values in the "AgeAt*" columns is negative.
My expected output for this output is to have row with index 1 since "AgeAtMedStart" has value -46.47 and row with index 4 since "AgeAtMedStop" is -51.9, so the output would be
AgeAtMedStart AgeAtMedStop MedContinuing Medication YearOfMedStart
1 -46.47 46.8 No Med1 2016.0
4 51.50 -51.9 No Med4 2016.0
So I've tried the different answers provided thus far, but all return an empty dataframe. And I believe part of the problem is that I have another column called AgeAtMedStartFlag (and AgeAtMedStopFlag) which contain strings. So for this sample csv:
RecordKey Medication CancerSiteForTreatment CancerSiteForTreatmentCode TreatmentLineCodeKey AgeAtMedStart AgeAtMedStartFlag YearOfMedStart MedContinuing AgeAtMedStop AgeAtMedStopFlag ChangeOfTreatment
1 Drug1 Site1 C1.0 First -46.47 Year And Month Are Known But Day Is Missing And Coded To 15 2016 No 46.8 Year And Month Are Known But Day Is Missing And Coded To 15 Yes
1 Drug2 Site2 C1.1 First 46.47 Year And Month Are Known But Day Is Missing And Coded To 15 2016 No 46.8 Year And Month Are Known But Day Is Missing And Coded To 15 Yes
1 Drug3 Site3 C1.2 First 46.8 Year And Month Are Known But Day Is Missing And Coded To 15 2016 Yes Yes
2 Drug4 Site4 C1.3 First 51.5 2016 No 51.9 Yes
2 Drug5 Site5 C1.4 First 51.5 2016 No -51.81 Yes
3 Drug6 Site6 C1.5 First 73.93 2016 No 74.42 Yes
3 Drug7 Site7 C1.6 First 73.93 2016 No 74.42 Yes
4 Drug8 Site8 C1.7 First 36.66 2015 No 37.24 Yes
4 Drug9 Site9 C1.8 First 36.66 2015 No 37.24 Yes
4 Drug10 Site10 C1.9 First 36.66 2015 No 37.24 Yes
9 Drug11 Site11 C1.10 First 43.55 2016 No 43.68 Yes
9 Drug12 Site12 C1.11 First 43.22 2016 No 43.49 Yes
9 Drug13 Site13 C1.12 First 43.55 2016 No 43.68 Yes
9 Drug14 Site14 C1.13 First 43.22 2016 No 43.49 Yes
10 Drug15 Site15 C1.14 First 74.42 2016 No 74.84 Yes
10 Drug16 Site16 C1.15 First 73.56 2015 No 73.98 Yes
10 Drug17 Site17 C1.16 First 73.56 2015 No 73.98 No
10 Drug18 Site18 C1.17 First 74.42 2016 No 74.84 No
10 Drug19 Site19 C1.18 First 73.56 2015 No 73.98 No
10 Drug20 Site20 C1.19 First 74.42 2016 No 74.84 No
11 Drug21 Site21 C1.20 First 70.72 2013 No 72.76 No
11 Drug22 Site22 C1.21 First 68.76 2011 No 70.62 No
11 Drug23 Site23 C1.22 First 73.43 2016 No 73.96 No
11 Drug24 Site24 C1.23 First 72.76 2015 No 73.43 No
with this change to my script:
age_df = df.columns[(df.columns.str.startswith('AgeAt')) & (~df.columns.str.endswith('Flag'))]
df[df[age_df] < 0].to_excel('invalid.xlsx', 'Benjamin_Button')
It returns:
RecordKey Medication CancerSiteForTreatment CancerSiteForTreatmentCode TreatmentLineCodeKey AgeAtMedStart AgeAtMedStartFlag YearOfMedStart MedContinuing AgeAtMedStop AgeAtMedStopFlag ChangeOfTreatment
1 -46.47
1
1
2
2 -51.81
3
3
4
4
4
9
9
9
9
10
10
10
10
10
10
11
11
11
11
Can I modify this implementation to only return the rows where the negatives are and if possible, the rest of the values for those rows? Or even better, just the negative ages and the RecordKey for that row.
Upvotes: 3
Views: 2706
Reputation: 3282
Here's a simple one-liner for you. If you need to logically determine if the column is numeric refer to coldspeed's answer. But, if you are ok with explicit column references a simple method like this will work.
Note I'm also filling NaN's with 0; this will meet your requirement even though the data is missing. Nan's can be handled in other ways, but this will suffice here. If you have missing values in other columns you'd like to preserve, this can also be done (I didn't include it here for simplicity).
myData = df.fillna(0).query('AgeAtMedStart < 0 or AgeAtMedStop < 0')
Returns:
AgeAtMedStart AgeAtMedStop MedContinuing Medication YearOfMedStart
1 -46.47 46.8 No Med1 2016.0
4 51.50 -51.9 No Med4 2016.0
Pandas native query method is very handy for simple filter expressions.
Refer to the docs for more info: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html
Upvotes: 1
Reputation: 4142
First get the columns of interest:
cols = [col for col in df if col.lower().startswith('AgeAt')]
Then get the DF with those columns:
df_wanted = df[cols]
Then get the rows:
x = df_wanted[df_wanted < 0]
Of course, if you are looking at multiple columns, some of the cells will contain nan
.
Upvotes: 0