Reputation: 125
I'm guessing that this is a question only for folks who enjoy the challenge of digging into the source code of Python Tracebacks ... but maybe someone knows the answer off the top of their head.
This should be easily reproducible, see the code below (I suppose that depending on your hardware and value for sys.setrecursionlimit(), you might need to increase the max iterations from my value of 2000).
It's numpy.genfromtxt reading a CSV file with 1 column and 1 row that consists of the single character 0. When "converters" is explicitly set (commented out below), all is well and fast. When "converters" is set indirectly as shown in the code, Python is doing something recursive that is totally unnecessary, and the code fails somewhere between 1400 and 1500 iterations (on my computer) with error "RecursionError: maximum recursion depth exceeded". Before the code fails, it gets slower and slower as iterations (and presumably, recursion depth) increase. The Traceback points to the source code involved, but I don't know how to dig into it.
The question is: Why doesn't this code work exactly as the code where "converters" is set explicitly? Is it a bug, or does it make sense; i.e., my code is bad?
#Spyder 3.3.3 | Python 3.7.3 64-bit | Qt 5.9.6 | PyQt5 5.9.2 | Windows 10
import numpy as np
the_converters = {'data': lambda s : 0}
jcount = 0
while jcount < 2000:
jcount = jcount + 1
print(jcount)
the_array = np.genfromtxt('recursion_debug.csv', delimiter =',', \
names = 'data', \
converters = the_converters, \
#converters = {'data': lambda s : 0}, \
)
Upvotes: 1
Views: 108
Reputation: 231325
In [1]: txt="""0,0
...: 0,0"""
In [14]: cvt = {'data':lambda s: 10}
In [15]: cvt
Out[15]: {'data': <function __main__.<lambda>(s)>}
In [16]: np.genfromtxt(txt.splitlines(),delimiter=',',usecols=[0],names='data',converters=cvt)
Out[16]: array([10, 10])
In [17]: cvt
Out[17]:
{'data': <function __main__.<lambda>(s)>,
0: functools.partial(<function genfromtxt.<locals>.tobytes_first at 0x7f5e71154bf8>, conv=<function <lambda> at 0x7f5e70928b70>)}
genfromtxt
is modifying the cvt
object (in-place), and this effect is cumulative:
In [18]: np.genfromtxt(txt.splitlines(),delimiter=',',usecols=[0],names='data',converters=cvt)
Out[18]: array([10, 10])
In [19]: cvt
Out[19]:
{'data': <function __main__.<lambda>(s)>,
0: functools.partial(<function genfromtxt.<locals>.tobytes_first at 0x7f5e82ea4bf8>, conv=functools.partial(<function genfromtxt.<locals>.tobytes_first at 0x7f5e71154bf8>, conv=<function <lambda> at 0x7f5e70928b70>))}
Note that the named key value does not change; rather it adds a column number key with the modified converter.
If instead we create the dictionary in-line, and just provide the lambda (or function), the function is not modified:
In [26]: cvt = lambda s: 10
In [27]: np.genfromtxt(txt.splitlines(),delimiter=',',usecols=[0],names='data',converters={'data':cvt})
Out[27]: array([10, 10])
In [28]: cvt
Out[28]: <function __main__.<lambda>(s)>
Now make a function that shows the input string as well:
In [53]: def foo(s):
...: print(s)
...: return '10'
...:
In [54]: cvt = {'data':foo}
If I specify the encoding
, the dictionary is still modified (new key), but the function isn't modified:
In [55]: np.genfromtxt(txt.splitlines(),delimiter=',',usecols=[0],names='data',converters=cvt, encoding=None)
0
0
0
Out[55]: array(['10', '10'], dtype='<U2')
In [56]: cvt
Out[56]: {'data': <function __main__.foo(s)>, 0: <function __main__.foo(s)>}
Without encoding (or the default 'bytes'), the tobytes
wrapper is added, and a bytestring is passed to my function:
In [57]: np.genfromtxt(txt.splitlines(),delimiter=',',usecols=[0],names='data',converters=cvt)
b'0'
b'0'
b'0'
b'0'
Out[57]: array(['10', '10'], dtype='<U2')
In [58]: cvt
Out[58]:
{'data': <function __main__.foo(s)>,
0: functools.partial(<function genfromtxt.<locals>.tobytes_first at 0x7f5e82e9c730>, conv=<function foo at 0x7f5e7113e268>)}
===
The code that added the functools.partial
is part of the old Py2 to Py3 bytes to unicode switch:
elif byte_converters:
# converters may use decode to workaround numpy's old behaviour,
# so encode the string again before passing to the user converter
def tobytes_first(x, conv):
if type(x) is bytes:
return conv(x)
return conv(x.encode("latin1"))
import functools
user_conv = functools.partial(tobytes_first, conv=conv)
else:
user_conv = conv
Upvotes: 1