Reputation: 1367
I'm trying to filter a dataframe based on several conditions. Then, I want to drop that subset from a separate, much larger dataframe.
df = pd.DataFrame({ 'A' : ['UNKNOWN','UNK','TEST','TEST'],
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })
df2 = pd.DataFrame({ 'A' : ['UNKNOWN','UNK','TEST','TEST','UNKOWN','UNKKK'],
'E' : pd.Categorical(["test","train","test","train",'train','train']),
'D' : np.array([3] * 6,dtype='int32'),
'F' : 'foo' })
rgx = r'UNKNOWN|UNK'
df_drop = df.loc[df['A'].str.contains(rgx, na=False, flags=re.IGNORECASE, regex=True, case=False)]
df2 = df2[~df_drop]
I want the following output for df2:
A D E F
2 TEST 3 test foo
3 TEST 3 train foo
Instead I get the following error:
TypeError: bad operand type for unary ~: 'str'
The reason I am not filtering df2 directly is that I want to make df_drop its own separate dataframe in order to retain the records that I have dropped.
I think I'm misunderstanding how the unary is supposed to work. Or I made a syntax error. But I can't find it and none of the previous solutions (for instance, removing NaNs from the dataframe) seem to be applicable here.
Upvotes: 2
Views: 4308
Reputation: 862406
I think you need filter in big dataframe:
rgx = r'UNKNOWN|UNK'
mask = df2['A'].str.contains(rgx, na=False, flags=re.IGNORECASE, regex=True, case=False)
print (mask)
0 True
1 True
2 False
3 False
4 True
5 True
Name: A, dtype: bool
print (df2[~mask])
A D E G
2 TEST 3 test foo
3 TEST 3 train foo
Upvotes: 5