Delete row based on value in any column of the dataframe

Question

There are several posts on how to drop rows if one column in a dataframe holds a certain undesired string, but I am struggling with how to do that if I have to check all columns in a dataset for that string, AND if I do not know beforehand exactly which column contains the string.

Suppose:

data = pd.DataFrame({'col1' : ['December 31,', 'December 31, 2019', 'countryB', 'countryC'],
                     'col2' : ['December 31,', 21, 19, 18],
                     'col3' : [np.NaN, 22, 23, 14]})

Which gives:

                col1          col2  col3
0       December 31,  December 31,   NaN
1  December 31, 2019            21  22.0
2           countryB            19  23.0
3           countryC            18  14.0

I want to delete all rows that contain December 31,, but not if December 31, is followed by a year in YYYY format. Is use a regex for that: r'Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec(?!.*\d{4})', which properly identifies December 31, only.

The problem is that I have many of such tables, and I do not know beforehand in which column the December 31, (or its equivalent for other months) appears.

What I currently do is:

delete = pd.DataFrame(columns = data.columns)
for name, content in data.iteritems():
    take = data[data[name].astype(str).str.contains(r'Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec(?!.*\d{4})',
                                regex = True,
                                flags = re.IGNORECASE & re.DOTALL, na = False)]
    delete = delete.append(take)
delete = delete.drop_duplicates()
index = mean(delete.index)
clean = data.drop([index])

Which returns, as desired:

                col1 col2  col3
1  December 31, 2019   21  22.0
2           countryB   19  23.0
3           countryC   18  14.0

That is, I loop over all columns in data, store in delete the rows that I want to delete from data, delete duplicates (because December 31, appears in col1 and col2), get the index of the unique undesired row (0 here) and then delete that row in data based on the index. It does work, but that seems like a cumbersome way of achieving this.

I am wondering: Is there a better way of deleting all rows in which December 31, appears in any column?

mcsoini · Accepted Answer

Use pd.DataFrame.any(...)

mask = data.astype(str).apply(lambda x: x.str.contains(r'Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec(?!.*\d{4})',
               regex = True, flags = re.IGNORECASE & re.DOTALL, na = False)).any(axis=1)

data.loc[~mask]

Delete row based on value in any column of the dataframe

Answers (2)

Related Questions