Reputation: 67
There are several posts on how to drop rows if one column in a dataframe holds a certain undesired string, but I am struggling with how to do that if I have to check all columns in a dataset for that string, AND if I do not know beforehand exactly which column contains the string.
Suppose:
data = pd.DataFrame({'col1' : ['December 31,', 'December 31, 2019', 'countryB', 'countryC'],
'col2' : ['December 31,', 21, 19, 18],
'col3' : [np.NaN, 22, 23, 14]})
Which gives:
col1 col2 col3
0 December 31, December 31, NaN
1 December 31, 2019 21 22.0
2 countryB 19 23.0
3 countryC 18 14.0
I want to delete all rows that contain December 31,, but not if December 31, is followed by a year in YYYY format. Is use a regex for that: r'Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec(?!.*\d{4})'
, which properly identifies December 31, only.
The problem is that I have many of such tables, and I do not know beforehand in which column the December 31, (or its equivalent for other months) appears.
What I currently do is:
delete = pd.DataFrame(columns = data.columns)
for name, content in data.iteritems():
take = data[data[name].astype(str).str.contains(r'Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec(?!.*\d{4})',
regex = True,
flags = re.IGNORECASE & re.DOTALL, na = False)]
delete = delete.append(take)
delete = delete.drop_duplicates()
index = mean(delete.index)
clean = data.drop([index])
Which returns, as desired:
col1 col2 col3
1 December 31, 2019 21 22.0
2 countryB 19 23.0
3 countryC 18 14.0
That is, I loop over all columns in data, store in delete the rows that I want to delete from data, delete duplicates (because December 31, appears in col1 and col2), get the index of the unique undesired row (0 here) and then delete that row in data based on the index. It does work, but that seems like a cumbersome way of achieving this.
I am wondering: Is there a better way of deleting all rows in which December 31, appears in any column?
Upvotes: 2
Views: 207
Reputation: 3519
data[~data.apply(lambda x: any([True if re.match('December 31,$',str(y)) else False for y in x]), axis=1)]
.apply
method to filter rows like this. r"December 31,$"'
regex works for your case? $
represent ending of string. If not just replace regex with your working regex.Upvotes: 2
Reputation: 6642
Use pd.DataFrame.any(...)
mask = data.astype(str).apply(lambda x: x.str.contains(r'Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec(?!.*\d{4})',
regex = True, flags = re.IGNORECASE & re.DOTALL, na = False)).any(axis=1)
data.loc[~mask]
Upvotes: 1