daiyue
daiyue

Reputation: 7448

how to check if a column in a data frame contains only a specific set of values

I need to decide if the values of a certain column (df['some_col']) in a data frame only contains a specific set of values (e.g. 'a', empty string and NaN i.e. ["a","",NaN]). I can think of using unique to list all the unique values and check if there is any value that is not in the predefined set, but I am not sure if NaN is considered as a value or not.

Upvotes: 1

Views: 985

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210812

yes, you can use unique() for that:

In [35]: w
Out[35]:
     word
0  word03
1     NaN
2  word04
3
4  word02
5  word01
6     NaN
7  word01
8  word01
9  word01

In [36]: w.word.unique()
Out[36]: array(['word03', nan, 'word04', '', 'word02', 'word01'], dtype=object)

so using sets we can see the difference between allowed/expected strings and strings in your DF:

In [45]: allowed_words = set(['','word01', np.nan])

In [46]: set(w.word.unique()) - allowed_words
Out[46]: {'word02', 'word03', 'word04'}

Upvotes: 3

Related Questions