qxzsilver
qxzsilver

Reputation: 655

Weird behavior in str count with Pandas DataFrame

I have the following Pandas DataFrame:

>>> sample_dataframe
        P
0  107.35
1   99.35
2   75.85
3   92.34

When I try the following, the output is as follows:

>>> sample_dataframe[sample_dataframe['P'].astype(str).str.count('.') == 1]

Empty DataFrame
Columns: [P]
Index: []

Whereas using the regex escaped character, the following occurs:

>>> sample_dataframe[sample_dataframe['P'].astype(str).str.count('\.') == 1]

        P
0  107.35
1   99.35
2   75.85
3   92.34

The following further reinforces this:

>>> sample_dataframe['P'].astype(str).str.count('.')

0    6
1    5
2    5
3    5
Name: P, dtype: int64

vs.

sample_dataframe['P'].astype(str).str.count('\.')

0    1
1    1
2    1
3    1
Name: P, dtype: int64

Thus, the . expression is actually counting all characters as the regex wildcard character, minus newline characters, hence the counts 6, 5, 5, 5 vs. the escaped \., which only counts the occurrence of the actual character ..

However, the regular function called from the string itself seems to be acting differently and doesn't need a regex escape of the '.':

>>> '105.35'.count('.')
1

>>> '105.35'.count('\.')
0

EDIT: Based on some of the answers, I will try to clarify the class function calls below (whereas right above is the instantiated object's method call):

>>> str.count('105.35', '.')
1

>>> str.count('105.35', '\.')
0

I am not sure if Pandas-related methods using CPython under the hood (due to NumPy operations) implements this as a Regex (including for df.apply), or if this is related to the difference in the str class function count (i.e. str.count()) vs. the str class method of the instantiated object (in the above example '105.35') count (i.e. '105.35'.count()). Is the difference between class vs. object function/method the underlying cause (and how they are implemented), or is this caused by how DataFrames are implemented via NumPy?

I would really like some more information on this to truly understand how this works

Upvotes: 1

Views: 352

Answers (2)

jezrael
jezrael

Reputation: 863226

If check Series.str.count it working with regex pattern by default, so is necessary escape \. for count . else it count all values by regex .'.

If want check how is implemented function in pandas check this.


str.count in pure python working different, not with regex but substring, so output different.

Upvotes: 0

Stepan
Stepan

Reputation: 1054

Thats because Pandas.Series.str.count and string count methods are different. You can see here ( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.count.html#pandas.Series.str.count ) that Pandas.Series.str.count takes regex as argument. And “.” regex means “any symbol”, while str.count gets count of provided substrings (not regex)

Upvotes: 1

Related Questions