Reputation: 510
Does rfind iterates over a string from end to start?
I read the docs https://docs.python.org/3.12/library/stdtypes.html#str.rfind and see
str.rfind(sub[, start[, end]])
Return the highest index in the string where substring sub is found, such that sub is contained within s[start:end]. Optional arguments start and end are interpreted as in slice notation. Return -1 on failure.
And the doc says not much about the implementation. Maybe there are some implementation notes somewhere else in the docs.
I have tried to look up the source code using my IDE (Visual Code) and it showed me something pretty much like an interface stub for some hidden native (C/C++) code.
def rfind(self, sub: str, start: SupportsIndex | None = ..., end: SupportsIndex | None = ..., /) -> int: ...
Then I have tried to find the appropriate source code on Github in Python repositories but it turned out not so easy.
I am a newbie in Python. So while it may be obvious for everyone around how to simply look up the source code needed to find the answer it is not straightforward for me.
Upvotes: 4
Views: 93
Reputation: 363304
The sources are easier to navigate if you're familiar with some history of Python. Specifically, the type str
was historically called unicode
in CPython and is still called unicode in the C sources.
So, for string methods:
Include/unicodeobject.h
Objects/unicodeobject.c
Python str.rfind
will eventually get to C unicode_rfind_impl
. In the main branch you can find auto-generated declarations at Objects/clinic/unicodeobject.c.h#L1074-L1116
and impl at Objects/unicodeobject.c#L12721-L12731
..
Note: These declarations generated Argument Clinic are a relatively recent development from gh-117431: Adapt str.find and friends to use the METH_FASTCALL calling convention (Apr 2024), and they are not used in any officially released version yet. For current 3.12.4 you should look in unicodeobject.c
directly.
You'll note that unicode_rfind_impl
calls any_find_slice
, passing in the direction -1
. This direction <= 0
is used to select a specific implementation depending on the width of the underlying unicode:
asciilib_rfind_slice
(for both strings ASCII)ucs1lib_rfind_slice
ucs2lib_rfind_slice
ucs4lib_rfind_slice
These calls end up in stringlib routines (stringlib/find.h:rfind_slice
-> stringlib/find.h:rfind
-> stringlib/fastsearch.h:FASTSEARCH
). Then, for the special case of 1-char substrings, we continue to stringlib/fastsearch.h:rfind_char
and eventually end up here where CPython seems to use either memrchr or reverse iteration, depending on the glibc version. For longer substrings we go to stringlib/fastsearch.h:default_rfind
, implemented here, which looks like some sort of Boyer-Moore algo with a bloom filter. An old effbot page describes an earlier version of this code as a "simplication (sic) of Boyer-Moore, incorporating ideas from Horspool and Sunday ... (with) ... a simple Bloom filter", but the implementation detail may have shifted somewhat since then (2006).
Finally, you can use stdlib timeit interactively to emprically verify that str.rfind
does not meaningfully slow down when dealing with longer strings. Taken by itself, this does not guarantee there is a reverse iteration, but it's certainly evidence that the implementation isn't just a naive iteration from the start.
>>> s1 = 'x' * 1000 + 'y'
>>> s2 = 'x' * 1_000_000 + 'y'
>>> timeit s1.rfind('y')
68.7 ns ± 0.183 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
>>> timeit s2.rfind('y')
68.9 ns ± 0.00835 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
Compare with putting the 'y' at the start, where we go from nanos to micros:
>>> s1 = 'y' + ('x' * 1000)
>>> s2 = 'y' + ('x' * 1_000_000)
>>> timeit s1.rfind('y')
73.6 ns ± 0.00866 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
>>> timeit s2.rfind('y')
22.3 μs ± 10.8 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Upvotes: 6