Reputation: 655
I have the following Pandas DataFrame:
>>> sample_dataframe
P
0 107.35
1 99.35
2 75.85
3 92.34
When I try the following, the output is as follows:
>>> sample_dataframe[sample_dataframe['P'].astype(str).str.count('.') == 1]
Empty DataFrame
Columns: [P]
Index: []
Whereas using the regex escaped character, the following occurs:
>>> sample_dataframe[sample_dataframe['P'].astype(str).str.count('\.') == 1]
P
0 107.35
1 99.35
2 75.85
3 92.34
The following further reinforces this:
>>> sample_dataframe['P'].astype(str).str.count('.')
0 6
1 5
2 5
3 5
Name: P, dtype: int64
vs.
sample_dataframe['P'].astype(str).str.count('\.')
0 1
1 1
2 1
3 1
Name: P, dtype: int64
Thus, the .
expression is actually counting all characters as the regex wildcard character, minus newline characters, hence the counts 6, 5, 5, 5 vs. the escaped \.
, which only counts the occurrence of the actual character .
.
However, the regular function called from the string itself seems to be acting differently and doesn't need a regex escape of the '.':
>>> '105.35'.count('.')
1
>>> '105.35'.count('\.')
0
EDIT: Based on some of the answers, I will try to clarify the class function calls below (whereas right above is the instantiated object's method call):
>>> str.count('105.35', '.')
1
>>> str.count('105.35', '\.')
0
I am not sure if Pandas-related methods using CPython under the hood (due to NumPy operations) implements this as a Regex (including for df.apply), or if this is related to the difference in the str
class function count
(i.e. str.count()
) vs. the str
class method of the instantiated object (in the above example '105.35'
) count
(i.e. '105.35'.count()
). Is the difference between class vs. object function/method the underlying cause (and how they are implemented), or is this caused by how DataFrames are implemented via NumPy?
I would really like some more information on this to truly understand how this works
Upvotes: 1
Views: 352
Reputation: 863226
If check Series.str.count
it working with regex pattern by default, so is necessary escape \.
for count .
else it count all values by regex .
'.
If want check how is implemented function in pandas check this.
str.count
in pure python working different, not with regex but substring, so output different.
Upvotes: 0
Reputation: 1054
Thats because Pandas.Series.str.count and string count methods are different. You can see here ( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.count.html#pandas.Series.str.count ) that Pandas.Series.str.count takes regex as argument. And “.” regex means “any symbol”, while str.count gets count of provided substrings (not regex)
Upvotes: 1