Reputation: 267
I have a pandas dataframe where one column is all float, another column either contains list of floats, None, or just float values. I have ensured all values are floats.
Ultimately, I want to use pd.isin()
to check how many records of value_1
are in value_2
but it is not working for me. When I ran this code below:
df[~df['value_1'].isin(df['value_2'])]
This below is what it returned which is not expected since clearly some values in value_1
are in the value_2
lists.:
0 88870.0 [88870.0]
1. 150700.0 None
2 225000.0 [225000.0, 225000.0]
3. 305000.0 [305606.0, 305000.0, 1067.5]
4 392000.0 [392000.0]
5 198400.0 396
What am I missing? Please help.
Upvotes: 1
Views: 178
Reputation: 863751
Use zip
with list comprehension for test if lists not contains floats, if not lists are removed rows by passing False
, filter in boolean indexing
:
df = pd.DataFrame({'value_1':[88870.0,150700.0,392000.0],
'value_2':[[88870.0],None, [88870.0,45.4]]})
print (df)
value_1 value_2
0 88870.0 [88870.0]
1 150700.0 None
2 392000.0 [88870.0, 45.4]
mask = [a not in b if isinstance(b, list) else False
for a, b in zip(df['value_1'], df['value_2'])]
df1 = df[mask]
print (df1)
value_1 value_2
2 392000.0 [88870.0, 45.4]
If need also test scalars:
mask = [a not in b if isinstance(b, list) else a != b
for a, b in zip(df['value_1'], df['value_2'])]
df2 = df[mask]
print (df2)
value_1 value_2
1 150700.0 None
2 392000.0 [88870.0, 45.4]
Performance: Pure python should be faster, best test in real data:
#20k rows
N = 10000
df = pd.DataFrame({'value_1':[88870.0,150700.0,392000.0] * N,
'value_2':[[88870.0],None, [88870.0,45.4]] * N})
print (df)
In [51]: %timeit df[[a not in b if isinstance(b, list) else a != b for a, b in zip(df['value_1'], df['value_2'])]]
18.8 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [52]: %timeit df[[not bool(np.isin(v1, v2)) for v1, v2 in zip(df['value_1'], df['value_2'])]]
419 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 1
Reputation: 262559
You can use boolean indexing with numpy.isin
in a list comprehension:
import numpy as np
out = df[[bool(np.isin(v1, v2)) for v1, v2 in zip(df['value_1'], df['value_2'])]]
Output:
value_1 value_2
0 88870.0 [88870.0]
2 225000.0 [225000.0, 225000.0]
3 305000.0 [305606.0, 305000.0, 1067.5]
4 392000.0 [392000.0]
Upvotes: 2