Reputation: 43
Been searching for a while in order to understand how to do this basic task without any success which is very strange.
I have a dataset where some of the rows contain '-', I have no clue under which columns these values lie.
How do I search in the whole dataset (including all columns) for '-' and drop the rows containing this value?
thank you!
Upvotes: 2
Views: 376
Reputation: 3591
This is a bit more robust than wwnde's answer, as it will work if some of the columns aren't originally strings:
df.loc[~df.apply(lambda x: any('-' in str(col) for col in x), axis = 1)]
If you have data that's stored as datetime, it will display as having -
, but will return an error if you check for inclusion without converting to str
first. Negative numbers will also return True once converted to str
. If you want different behavior, you'll have to do something more complicated, such as
df.loc[~df.apply(lambda x: any('-' in col if isinstance(col, str) else False for col in x), axis = 1)]
Upvotes: 0
Reputation: 26676
Please Try
df[~df.apply(lambda x: x.str.contains('-')).sum(1).gt(0)]
How it works
df.apply(lambda x: x.str.contains('-'))#Gives a boolean return of where there is '-'
df.apply(lambda x: x.str.contains('-')).sum(1)# Sums the boolean in each row.
df.apply(lambda x: x.str.contains('-')).sum(1).gt(0)#tests and returns true where the value was greater than zero, meaning anywhere where there was '-' returned at least 1
~df.apply(lambda x: x.str.contains('-')).sum(1).gt(0)- Reverse the above, giving you rows where there was no '-'
df[~df.apply(lambda x: x.str.contains('-')).sum(1).gt(0)]# Mask the rows where there was no'-' into a dataframe
Upvotes: 0
Reputation: 146
df = pd.DataFrame([['-', 2, '-', 0],
[3, 4, '-', 1],
['-', '-', '-', 5],
['-', 3, '-', 4]],
columns=list('ABCD'))
then df looks like
df =
A B C D
0 - 2 - 0
1 3 4 - 1
2 - - - 5
3 - 3 - 4
you can replace all '-' with nan values with df.replace()
function. you can do help(df.replace)
do learn more
df = df.replace('-', np.nan)
df=
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
to drop columns with these '-' you can drop them with df.dropna(how='all')
Upvotes: 2