creating realy huge scipy array

Question

I want to create a scipy array from a really huge list. But unfortunately I stumbled across a problem.

I have a list xs, of strings. Each string has the length 1.

>>> type(xs)

>>> len(xs)
4001844816

If I convert only the first 10 elements, everything still works as expected.

>>> s = xs[0:10]
>>> x = scipy.array(s)
>>> x
array(['A', 'B', 'C', 'D', 'E', 'F', 'O', 'O'],
      dtype='|S1‘)
>>> len(x)
10

For the whole list I get this result:

>>> ary = scipy.array(xs)
>>> ary.size
1
>>> ary.shape
()
>>> ary[0]
Traceback (most recent call last):
  File "", line 1, in 
IndexError: 0-d arrays can't be indexed
>>>ary[()]
...The long list

A workaround would be:

test = scipy.zeros(len(xs), dtype=(str, 1))
for i in xrange(len(xs)):
    test[i] = xs[i]

It is not a problem of insufficient memory. So far I will use the workaround (which takes 15 minutes). But I would like to understand the problem.

Thank you

-- Edit: Remark to workaround test[:] = xs will not work. (Also fails with 0-d IndexError)

On my macbook 2147483648 was the smallest size causing the problem. I determined it with this small script:

#!/usr/bin/python
import scipy as sp

startlen = 2147844816

xs = ["A"] * startlen
ary = sp.array(xs)
while ary.shape == ():
    print "bad", len(xs)
    xs.pop()
    ary = sp.array(xs)

print "good", len(xs)
print ary.shape, ary[0:10]
print "DONE."

This was the output

...
bad 2147483649
bad 2147483648
good 2147483647
(2147483647,) ['A' 'A' 'A' 'A' 'A' 'A' 'A' 'A' 'A' 'A']
DONE.

The python version is

>>> sys.version
'2.7.5 (default, Aug 25 2013, 00:04:04) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)]'
>>> scipy.version.version
'0.11.0'

Paul · Accepted Answer

Assuming you have 64 bit OS/Python/Numpy you might be having some manifestations of out of memory conditions - which can sometimes be unusual. Your first list is 4GB then you allocated an additional 4GB for the numpy array. Even for x64 those are big arrays. Have you seen memmap arrays before?

What I have done below is created a series of memmap arrays showing where (for my machine) the breaking points are (primarily disk IO). However, decent array sizes can be created 30 Billion 'S1' elements. This code might help you to see if memmap array can provide some benefit for your problem. They are easy to work with. Your 15 minute workaround could be sped up using memmap arrays.

baseNumber = 3000000L
#dataType = 'float64'#
numBytes = 1
dataType = 'S1'
for powers in arange(1,7):
  l1 = baseNumber*10**powers
  print('working with %d elements'%(l1))
  print('number bytes required %f GB'%(l1*numBytes/1e9))
  try:
    fp = numpy.memmap('testa.map',dtype=dataType, mode='w+',shape=(1,l1))
    #works 
    print('works')
    del fp
  except Exception as e:
    print(repr(e))


"""
working with 30000000 elements
number bytes required 0.030000 GB
works
working with 300000000 elements
number bytes required 0.300000 GB
works
working with 3000000000 elements
number bytes required 3.000000 GB
works
working with 30000000000 elements
number bytes required 30.000000 GB
works
working with 300000000000 elements
number bytes required 300.000000 GB
IOError(28, 'No space left on device')
working with 3000000000000 elements
number bytes required 3000.000000 GB
IOError(28, 'No space left on device')


"""

creating realy huge scipy array

Answers (1)

Related Questions