MMM
MMM

Reputation: 67

Delete row based on value in any column of the dataframe

There are several posts on how to drop rows if one column in a dataframe holds a certain undesired string, but I am struggling with how to do that if I have to check all columns in a dataset for that string, AND if I do not know beforehand exactly which column contains the string.

Suppose:

data = pd.DataFrame({'col1' : ['December 31,', 'December 31, 2019', 'countryB', 'countryC'],
                     'col2' : ['December 31,', 21, 19, 18],
                     'col3' : [np.NaN, 22, 23, 14]})

Which gives:

                col1          col2  col3
0       December 31,  December 31,   NaN
1  December 31, 2019            21  22.0
2           countryB            19  23.0
3           countryC            18  14.0

I want to delete all rows that contain December 31,, but not if December 31, is followed by a year in YYYY format. Is use a regex for that: r'Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec(?!.*\d{4})', which properly identifies December 31, only.

The problem is that I have many of such tables, and I do not know beforehand in which column the December 31, (or its equivalent for other months) appears.

What I currently do is:

delete = pd.DataFrame(columns = data.columns)
for name, content in data.iteritems():
    take = data[data[name].astype(str).str.contains(r'Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec(?!.*\d{4})',
                                regex = True,
                                flags = re.IGNORECASE & re.DOTALL, na = False)]
    delete = delete.append(take)
delete = delete.drop_duplicates()
index = mean(delete.index)
clean = data.drop([index])

Which returns, as desired:

                col1 col2  col3
1  December 31, 2019   21  22.0
2           countryB   19  23.0
3           countryC   18  14.0

That is, I loop over all columns in data, store in delete the rows that I want to delete from data, delete duplicates (because December 31, appears in col1 and col2), get the index of the unique undesired row (0 here) and then delete that row in data based on the index. It does work, but that seems like a cumbersome way of achieving this.

I am wondering: Is there a better way of deleting all rows in which December 31, appears in any column?

Upvotes: 2

Views: 207

Answers (2)

Poojan
Poojan

Reputation: 3519

data[~data.apply(lambda x: any([True if re.match('December 31,$',str(y)) else False for y in x]), axis=1)]
  • You can use .apply method to filter rows like this.
  • Doesn't using r"December 31,$"' regex works for your case? $ represent ending of string. If not just replace regex with your working regex.

Upvotes: 2

mcsoini
mcsoini

Reputation: 6642

Use pd.DataFrame.any(...)

mask = data.astype(str).apply(lambda x: x.str.contains(r'Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec(?!.*\d{4})',
               regex = True, flags = re.IGNORECASE & re.DOTALL, na = False)).any(axis=1)

data.loc[~mask]

Upvotes: 1

Related Questions