n8-da-gr8
n8-da-gr8

Reputation: 551

Select rows where values in at least one column is negative

Given a DataFrame:

df = pd.DataFrame(
    {'AgeAtMedStart': {1: -46.47, 2: 46.47, 3: 46.8, 4: 51.5, 5: 51.5},
     'AgeAtMedStop': {1: 46.8, 2: 46.8, 3: nan, 4: -51.9, 5: 51.81},
     'MedContinuing': {1: 'No', 2: 'No', 3: 'Yes', 4: 'No', 5: 'No'},
     'Medication': {1: 'Med1', 2: 'Med2', 3: 'Med3', 4: 'Med4', 5: 'Med4'},
     'YearOfMedStart': {1: 2016.0, 2: 2016.0, 3: 2016.0, 4: 2016.0, 5: 2016.0}}
)

df 
   AgeAtMedStart  AgeAtMedStop MedContinuing Medication  YearOfMedStart
1         -46.47         46.80            No       Med1          2016.0
2          46.47         46.80            No       Med2          2016.0
3          46.80           NaN           Yes       Med3          2016.0
4          51.50        -51.90            No       Med4          2016.0
5          51.50         51.81            No       Med4          2016.0

I want to filter to retain rows where any of the numeric values in the "AgeAt*" columns is negative.

My expected output for this output is to have row with index 1 since "AgeAtMedStart" has value -46.47 and row with index 4 since "AgeAtMedStop" is -51.9, so the output would be

   AgeAtMedStart  AgeAtMedStop MedContinuing Medication  YearOfMedStart
1         -46.47          46.8            No       Med1          2016.0
4          51.50         -51.9            No       Med4          2016.0

EDIT1:

So I've tried the different answers provided thus far, but all return an empty dataframe. And I believe part of the problem is that I have another column called AgeAtMedStartFlag (and AgeAtMedStopFlag) which contain strings. So for this sample csv:

RecordKey   Medication  CancerSiteForTreatment  CancerSiteForTreatmentCode  TreatmentLineCodeKey    AgeAtMedStart   AgeAtMedStartFlag   YearOfMedStart  MedContinuing   AgeAtMedStop    AgeAtMedStopFlag    ChangeOfTreatment
1   Drug1   Site1   C1.0    First   -46.47  Year And Month Are Known But Day Is Missing And Coded To 15 2016    No  46.8    Year And Month Are Known But Day Is Missing And Coded To 15 Yes
1   Drug2   Site2   C1.1    First   46.47   Year And Month Are Known But Day Is Missing And Coded To 15 2016    No  46.8    Year And Month Are Known But Day Is Missing And Coded To 15 Yes
1   Drug3   Site3   C1.2    First   46.8    Year And Month Are Known But Day Is Missing And Coded To 15 2016    Yes         Yes
2   Drug4   Site4   C1.3    First   51.5        2016    No  51.9        Yes
2   Drug5   Site5   C1.4    First   51.5        2016    No  -51.81      Yes
3   Drug6   Site6   C1.5    First   73.93       2016    No  74.42       Yes
3   Drug7   Site7   C1.6    First   73.93       2016    No  74.42       Yes
4   Drug8   Site8   C1.7    First   36.66       2015    No  37.24       Yes
4   Drug9   Site9   C1.8    First   36.66       2015    No  37.24       Yes
4   Drug10  Site10  C1.9    First   36.66       2015    No  37.24       Yes
9   Drug11  Site11  C1.10   First   43.55       2016    No  43.68       Yes
9   Drug12  Site12  C1.11   First   43.22       2016    No  43.49       Yes
9   Drug13  Site13  C1.12   First   43.55       2016    No  43.68       Yes
9   Drug14  Site14  C1.13   First   43.22       2016    No  43.49       Yes
10  Drug15  Site15  C1.14   First   74.42       2016    No  74.84       Yes
10  Drug16  Site16  C1.15   First   73.56       2015    No  73.98       Yes
10  Drug17  Site17  C1.16   First   73.56       2015    No  73.98       No
10  Drug18  Site18  C1.17   First   74.42       2016    No  74.84       No
10  Drug19  Site19  C1.18   First   73.56       2015    No  73.98       No
10  Drug20  Site20  C1.19   First   74.42       2016    No  74.84       No
11  Drug21  Site21  C1.20   First   70.72       2013    No  72.76       No
11  Drug22  Site22  C1.21   First   68.76       2011    No  70.62       No
11  Drug23  Site23  C1.22   First   73.43       2016    No  73.96       No
11  Drug24  Site24  C1.23   First   72.76       2015    No  73.43       No

with this change to my script:

age_df = df.columns[(df.columns.str.startswith('AgeAt')) & (~df.columns.str.endswith('Flag'))]

df[df[age_df] < 0].to_excel('invalid.xlsx', 'Benjamin_Button')

It returns:

RecordKey   Medication  CancerSiteForTreatment  CancerSiteForTreatmentCode  TreatmentLineCodeKey    AgeAtMedStart   AgeAtMedStartFlag   YearOfMedStart  MedContinuing   AgeAtMedStop    AgeAtMedStopFlag    ChangeOfTreatment
1                   -46.47                     
1                                          
1                                          
2                                          
2                                   -51.81     
3                                          
3                                          
4                                          
4                                          
4                                          
9                                          
9                                          
9                                          
9                                          
10                                         
10                                         
10                                         
10                                         
10                                         
10                                         
11                                         
11                                         
11                                         
11

Can I modify this implementation to only return the rows where the negatives are and if possible, the rest of the values for those rows? Or even better, just the negative ages and the RecordKey for that row.

Upvotes: 3

Views: 2706

Answers (2)

Sevyns
Sevyns

Reputation: 3282

Here's a simple one-liner for you. If you need to logically determine if the column is numeric refer to coldspeed's answer. But, if you are ok with explicit column references a simple method like this will work.

Note I'm also filling NaN's with 0; this will meet your requirement even though the data is missing. Nan's can be handled in other ways, but this will suffice here. If you have missing values in other columns you'd like to preserve, this can also be done (I didn't include it here for simplicity).

myData = df.fillna(0).query('AgeAtMedStart < 0 or AgeAtMedStop < 0')

Returns:

   AgeAtMedStart  AgeAtMedStop MedContinuing Medication  YearOfMedStart
1         -46.47          46.8            No       Med1          2016.0
4          51.50         -51.9            No       Med4          2016.0

Pandas native query method is very handy for simple filter expressions.

Refer to the docs for more info: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html

Upvotes: 1

pookie
pookie

Reputation: 4142

First get the columns of interest:

cols = [col for col in df if col.lower().startswith('AgeAt')]

Then get the DF with those columns:

df_wanted = df[cols]

Then get the rows:

x = df_wanted[df_wanted < 0]

Of course, if you are looking at multiple columns, some of the cells will contain nan.

Upvotes: 0

Related Questions