S44
S44

Reputation: 473

Pandas: Delete Rows in a dataframe if specific columns don't contain specific text

I have a df

     id  column_int column_int  column_A column_B column_C column_D
 0   1        int       int         ABC     ABC     Keep      na
 1   2        int       int         ABC     ABC     ABC       ABC
 2   3        int       int         ABC     Save    na        na
 3   4        int       int         ABC     Keep    na        na
 4   5        int       imt         ABC     ABC     ABC       ABC
 .
 . 

Where column_int are columns that contain ints and column A-D contain text values. I want to keep only the rows that have either Keep or Save as row values

Before:

 id  column_int column_int  column_A column_B column_C column_D
 0   1        int       int         ABC     ABC     Keep      na
 1   2        int       int         ABC     ABC     ABC       ABC
 2   3        int       int         ABC     Save    na        na
 3   4        int       int         ABC     Keep    na        na
 4   5        int       imt         ABC     ABC     ABC       ABC

After:

 id  column_int column_int  column_A column_B column_C column_D
 0   1        int       int         ABC     ABC     Keep      na
 2   3        int       int         ABC     Save    na        na
 3   4        int       int         ABC     Keep    na        na

I tried the following

for column in df:
    if type(column) == object:
        df = df[df[column].str.contains('Save')] | df[df[column].str.contains('Keep')]
    else:
        pass

Upvotes: 2

Views: 1287

Answers (2)

SeaBean
SeaBean

Reputation: 23217

You can use use .apply() on the selected columns, then for each column check for Save or Keep by str.contains. Then, use .any() on axis=1 (for row-wise operation) to check if the row contains such strings.

Finally, filter by .loc, as follows:

cols = ['column_A',  'column_B',  'column_C',  'column_D']

df.loc[df[cols].apply(lambda x: x.str.contains(r'Save|Keep')).any(axis=1)]

Result:

   id column_int column_int.1 column_A column_B column_C column_D
0   1        int          int      ABC      ABC     Keep       na
2   3        int          int      ABC     Save       na       na
3   4        int          int      ABC     Keep       na       na

Upvotes: 2

Esa Tuulari
Esa Tuulari

Reputation: 169

Maybe easier and clearer to do without the for-loop.

dfA = df.loc[(df.column_A == 'Save') or (df.column_A == 'Keep')]
dfB = df.loc[(df.column_B == 'Save') or (df.column_B == 'Keep')]
dfC = df.loc[(df.column_C == 'Save') or (df.column_C == 'Keep')]
dfD = df.loc[(df.column_D == 'Save') or (df.column_D == 'Keep')]

Then concatenate the dataframes together

df = pd.concat([dfA, dfB, dfC, dfD])

Upvotes: 1

Related Questions