gcfitzgerald
gcfitzgerald

Reputation: 1

Peculiar pandas 'is' vs '==' behaviour with functions referencing data frame elements

In writing a function that returns the exact (row, column) position of a known element in a data frame (is there an efficient built-in function already?), I came across the following strange behaviour. It is easiest to describe with an example.

Use the following data frame:

In [0] df = pd.DataFrame({'A': ['one', 'two', 'three'] , 'B': ['foo', 'bar', 'foo'], 'C':[1,2,3], 'D':[4,5,6]}, index = [0,1,2])

In [1] df

Out [1]:

    A   B   C   D
0   one foo 1   4
1   two bar 2   5
2   three   foo 3   6

My original function to return an exact (row, col) tuple used "is" as I wanted to ensure I was referring to the correct object, rather than the first occurring object in the data frame that held the same numeric value so if I wanted the index of the number 4 in (0,'D'), I wanted to make sure I wasn't referencing a number 4 that happened be in (0,'A') for example. My original data frame was all floats, but I've used the simplified one above with strings and ints to highlight some of the strange behaviour, as well as written a simplified function to show the quirky behaviour.

I create this function to return the element at a particular (row,col) location in the data frame.

In [2] def testr(datframe,row,col):

return datframe[col][row]

Now using this function to test object reference equality (pointing to the same thing):

In [3] df.loc[0,'B'] is testr(df,0,'B')

Out [3] True

All good. However, trying a numeric entry:

In [4] df.loc[0,'C'] is testr(df,0,'C')

Out [4] False

This is confusing to me. I thought that my function was returning a reference to a particular element in the data frame and thus 'is' should return True, as in the case of a string element.

Something is going on behind the scenes with the return from my function, and it appears that what is being returned is not the same object that is in the data frame, but a copy, when that element is a numeric. Note that substituting '==' for 'is' works fine for numeric elements (as one would expect).

Can anyone assist me in understanding more deeply what is happening here?

Many thanks.

Upvotes: 0

Views: 35

Answers (1)

juanpa.arrivillaga
juanpa.arrivillaga

Reputation: 96172

I thought that my function was returning a reference to a particular element in the data frame and thus 'is' should return True, as in the case of a string element.

No. A new python object is created each time you retrieve the item, because it isn't stored as a python object (e.g. with an object dtype) it's stored in a primitive buffer of primitive, 64-bit (or possibly 32 bit) integers. This is similar to "automatic boxing" in OOP languages with primitive types (as opposed to reference types- note, Python itself has no such distinction, everything is always an object).

So, consider:

>>> import numpy as np
>>> import sys
>>> arr = np.array([1,2,3], dtype=np.int64)
>>> arr.nbytes
24
>>> arr.nbytes == 3*8
True
>>> e1 = arr[0]
>>> sys.getsizeof(e1) # not 64 bits (8 bytes), it's actually a big python object
32
>>> e2 = arr[0]
>>> e1
1
>>> type(e1)
<class 'numpy.int64'>
>>> e1 is e2
False

Upvotes: 1

Related Questions