Reputation: 4705
I'm wondering how to ensure that all rows in a dataframe contain a particular set of values.
For example:
VALUES = [1, 2]
df_no = pd.DataFrame(
{
"a": [1],
"b": [1],
}
)
df_yes = pd.DataFrame(
{
"a": [1],
"b": [2],
"c": [3],
}
)
Here df_no
doesn't contain values of VALUES
in each of its rows, whereas df_yes
does.
An approach is the following:
# check df_no
all(
[
all(value in row for value in VALUES)
for row in df_no.apply(lambda x: x.unique(), axis=1)
]
)
# returns False
# check df_yes
all(
[
all(value in row for value in VALUES)
for row in df_yes.apply(lambda x: x.unique(), axis=1)
]
)
# returns True
I feel as though the approaches here might be so clear, and that there might be a more idiomatic way of going about things.
Upvotes: 2
Views: 115
Reputation: 862641
Use issubset
in generator comprehension:
s = set(VALUES)
print (all(s.issubset(x) for x in df_no.to_numpy()))
False
s = set(VALUES)
print (all(s.issubset(x) for x in df_yes.to_numpy()))
True
What is faster? Depends of data:
VALUES = [1, 2]
df = pd.DataFrame(
{
"a": [1,2,8],
"b": [2,8,2],
"c": [3,1,1],
}
)
#30k rows
df = pd.concat([df] * 10000, ignore_index=True)
print (df)
In [171]: %%timeit
...: s = set(VALUES)
...: all(s.issubset(x) for x in df.to_numpy())
...:
...:
55.9 ms ± 2.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [172]: %%timeit
...: vals = set(VALUES)
...: df.apply(vals.issubset, axis=1).all()
...:
...:
211 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#3k rows
df = pd.concat([df] * 1000, ignore_index=True)
print (df)
In [174]: %%timeit
...: s = set(VALUES)
...: all(s.issubset(x) for x in df.to_numpy())
...:
...:
5.46 ms ± 76.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [175]: %%timeit
...: vals = set(VALUES)
...: df.apply(vals.issubset, axis=1).all()
...:
...:
21.5 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 3
Reputation: 260630
You can use python sets and issubset
:
vals = set(VALUES)
df_yes.apply(lambda x: vals.issubset(set(x)), axis=1).all()
shorter version:
vals = set(VALUES)
df_yes.apply(vals.issubset, axis=1).all()
Upvotes: 1