onepiece
onepiece

Reputation: 3529

Identity quirk with string split()

>>> 'hi'.split()[0] is 'hi'
    True    
>>> 'hi there'.split()[0] is 'hi'
    False
>>> 'hi there again'.split()[0] is 'hi'
    False

My hypothesis:

The first line has only one element in split, while the other two have more than one element. I believe that while Python primitives like str are stored in memory by value within a function, there will be separate allocations across functions to simplify memory management. I think split() is one of those functions, and it usually allocates new strings. But it also handles the edge case of input that does not need any splitting (such as 'hi'), where the original string reference is simply returned. Is my explanation correct?

Upvotes: 5

Views: 140

Answers (3)

user2357112
user2357112

Reputation: 280973

I believe that while Python primitives like str are stored in memory by value within a function, there will be separate allocations across functions to simplify memory management.

Python's object allocation doesn't work anything like that. There isn't a real concept of "primitives", and aside from a few things the bytecode compiler does to merge constants, it doesn't matter whether two objects are created in the same function or different functions.

There isn't really a better answer to this than to point to the source, so here it is:

Py_LOCAL_INLINE(PyObject *)
STRINGLIB(split_whitespace)(PyObject* str_obj,
                           const STRINGLIB_CHAR* str, Py_ssize_t str_len,
                           Py_ssize_t maxcount)
{
    ...
#ifndef STRINGLIB_MUTABLE
        if (j == 0 && i == str_len && STRINGLIB_CHECK_EXACT(str_obj)) {
            /* No whitespace in str_obj, so just use it as list[0] */
            Py_INCREF(str_obj);
            PyList_SET_ITEM(list, 0, (PyObject *)str_obj);
            count++;
            break;
        }

If it doesn't find any whitespace to split on, it just reuses the original string object in the returned list. It's just a quirk of how this function was written, and you can't depend on it working that way in other Python versions or nonstandard Python implementations.

Upvotes: 1

gkusner
gkusner

Reputation: 1244

So like I said in comment:

'hi there again'.split()[0] == 'hi'

>>True

Actually your question kind of nailed it - it's an identity.

Upvotes: 0

dsh
dsh

Reputation: 12214

All data in Python is stored by reference. (A PyObject* in the C implementation) What you discovered is that .split() simply returned self as an optimization when the delimiter was not found. When the delimiter is found it must create separate string objects for each part and so they are separate objects.

(Unlike Java which has distinctly different data types for "primitives" and "reference/class types" and behaves differently with them)

Upvotes: 0

Related Questions