Weird behavior in str count with Pandas DataFrame

Question

I have the following Pandas DataFrame:

>>> sample_dataframe
        P
0  107.35
1   99.35
2   75.85
3   92.34

When I try the following, the output is as follows:

>>> sample_dataframe[sample_dataframe['P'].astype(str).str.count('.') == 1]

Empty DataFrame
Columns: [P]
Index: []

Whereas using the regex escaped character, the following occurs:

>>> sample_dataframe[sample_dataframe['P'].astype(str).str.count('\.') == 1]

        P
0  107.35
1   99.35
2   75.85
3   92.34

The following further reinforces this:

>>> sample_dataframe['P'].astype(str).str.count('.')

0    6
1    5
2    5
3    5
Name: P, dtype: int64

vs.

sample_dataframe['P'].astype(str).str.count('\.')

0    1
1    1
2    1
3    1
Name: P, dtype: int64

Thus, the . expression is actually counting all characters as the regex wildcard character, minus newline characters, hence the counts 6, 5, 5, 5 vs. the escaped \., which only counts the occurrence of the actual character ..

However, the regular function called from the string itself seems to be acting differently and doesn't need a regex escape of the '.':

>>> '105.35'.count('.')
1

>>> '105.35'.count('\.')
0

EDIT: Based on some of the answers, I will try to clarify the class function calls below (whereas right above is the instantiated object's method call):

>>> str.count('105.35', '.')
1

>>> str.count('105.35', '\.')
0

I am not sure if Pandas-related methods using CPython under the hood (due to NumPy operations) implements this as a Regex (including for df.apply), or if this is related to the difference in the str class function count (i.e. str.count()) vs. the str class method of the instantiated object (in the above example '105.35') count (i.e. '105.35'.count()). Is the difference between class vs. object function/method the underlying cause (and how they are implemented), or is this caused by how DataFrames are implemented via NumPy?

I would really like some more information on this to truly understand how this works

Stepan · Accepted Answer

Thats because Pandas.Series.str.count and string count methods are different. You can see here ( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.count.html#pandas.Series.str.count ) that Pandas.Series.str.count takes regex as argument. And “.” regex means “any symbol”, while str.count gets count of provided substrings (not regex)

Weird behavior in str count with Pandas DataFrame

Answers (2)

Related Questions