Reputation: 1111
Suppose i have this numpy array
[[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12]]
And i want to split it in 2 batches and then iterate:
[[1, 2, 3], Batch 1
[4, 5, 6]]
[[7, 8, 9], Batch 2
[10, 11, 12]]
What is the simplest way to do it?
EDIT: I'm deeply sorry i missed putting such info: Once i intend to carry on with the iteration, the original array would be destroyed due to splitting and iterating over batches. Once the batch iteration finished, i need to restart again from the first batch hence I should preserve that the original array wouldn't be destroyed. The whole idea is to be consistent with Stochastic Gradient Descent algorithms which require iterations over batches. In a typical example, I could have a 100000 iteration For loop for just 1000 batch that should be replayed again and again.
Upvotes: 16
Views: 20559
Reputation: 2167
Improving previous answer, to split based on batch size, you can use:
def split_by_batchsize(arr, batch_size):
return np.array_split(arr, (arr.shape[0]/batch_size)+1)
or with extra safety:
def split_by_batch_size(arr, batch_size):
nbatches = arr.shape[0]//batch_size
if nbatches != arr.shape[0]/batch_size:
nbatches += 1
return np.array_split(arr, nbatches)
example:
import numpy as np
ncols = 17
batch_size= 2
split_by_batchsize(np.random.random((ncols, 2)), batch_size)
# [array([[0.60482079, 0.81391257],
# [0.00175093, 0.25126441]]),
# array([[0.48591974, 0.77793401],
# [0.72128946, 0.3606879 ]]),
# array([[0.95649328, 0.24765806],
# [0.78844782, 0.56304567]]),
# array([[0.07310456, 0.76940976],
# [0.92163079, 0.90803845]]),
# array([[0.77838703, 0.98460593],
# [0.88397437, 0.39227769]]),
# array([[0.87599421, 0.7038426 ],
# [0.19780976, 0.12763436]]),
# array([[0.14263759, 0.9182901 ],
# [0.40523958, 0.0716843 ]]),
# array([[0.9802908 , 0.01067808],
# [0.53095143, 0.74797636]]),
# array([[0.7596607 , 0.97923229]])]
Sadly, simple iteration is faster than this fancy method. Thus, I did not suggest you to use this approach.
batch_size = 3
nrows = 1000
arr = np.random.random((nrows, 2))
%%timeit
for i in range((arr.shape[0] // batch_size) + 1):
idx = i*batch_size
foo = arr[idx:idx+batch_size,:]
# 345 µs ± 119 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
for foo in split_by_batch_size(arr, batch_size):
pass
# 1.84 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The speed different seems came from np.array_split
create list of array first.
Upvotes: 2
Reputation: 891
To avoid the error "array split does not result in an equal division",
np.array_split(arr, n, axis=0)
is better than np.split(arr, n, axis=0)
.
For example,
a = np.array([[170, 52, 204],
[114, 235, 191],
[ 63, 145, 171],
[ 16, 97, 173]])
then
print(np.array_split(a, 2))
[array([[170, 52, 204],
[114, 235, 191]]), array([[ 63, 145, 171],
[ 16, 97, 173]])]
print(np.array_split(a, 3))
[array([[170, 52, 204],
[114, 235, 191]]), array([[ 63, 145, 171]]), array([[ 16, 97, 173]])]
However, print(np.split(a, 3))
will raise an error since 4/3
is not an integer.
Upvotes: 7
Reputation: 294488
consider array a
a = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12]])
Option 1
use reshape
and //
a.reshape(a.shape[0] // 2, -1, a.shape[1])
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 7, 8, 9],
[10, 11, 12]]])
Option 2
if you wanted groups of two rather than two groups
a.reshape(-1, 2, a.shape[1])
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 7, 8, 9],
[10, 11, 12]]])
Option 3
Use a generator
def get_every_n(a, n=2):
for i in range(a.shape[0] // n):
yield a[n*i:n*(i+1)]
for sa in get_every_n(a, n=2):
print sa
[[1 2 3]
[4 5 6]]
[[ 7 8 9]
[10 11 12]]
Upvotes: 17
Reputation: 1668
This is what I have used to iterate through. I use b.next()
method to generate the indices, then pass the output to slice a numpy array, for example a[b.next()]
where a is a numpy array.
class Batch():
def __init__(self, total, batch_size):
self.total = total
self.batch_size = batch_size
self.current = 0
def next(self):
max_index = self.current + self.batch_size
indices = [i if i < self.total else i - self.total
for i in range(self.current, max_index)]
self.current = max_index % self.total
return indices
b = Batch(10, 3)
print(b.next()) # [0, 1, 2]
print(b.next()) # [3, 4, 5]
print(b.next()) # [6, 7, 8]
print(b.next()) # [9, 0, 1]
print(b.next()) # [2, 3, 4]
print(b.next()) # [5, 6, 7]
Upvotes: 0
Reputation: 221614
You can use numpy.split
to split along the first axis n
times, where n
is the number of desired batches. Thus, the implementation would look like this -
np.split(arr,n,axis=0) # n is number of batches
Since, the default value for axis
is 0
itself, so we can skip setting it. So, we would simply have -
np.split(arr,n)
Sample runs -
In [132]: arr # Input array of shape (10,3)
Out[132]:
array([[170, 52, 204],
[114, 235, 191],
[ 63, 145, 171],
[ 16, 97, 173],
[197, 36, 246],
[218, 75, 68],
[223, 198, 84],
[206, 211, 151],
[187, 132, 18],
[121, 212, 140]])
In [133]: np.split(arr,2) # Split into 2 batches
Out[133]:
[array([[170, 52, 204],
[114, 235, 191],
[ 63, 145, 171],
[ 16, 97, 173],
[197, 36, 246]]), array([[218, 75, 68],
[223, 198, 84],
[206, 211, 151],
[187, 132, 18],
[121, 212, 140]])]
In [134]: np.split(arr,5) # Split into 5 batches
Out[134]:
[array([[170, 52, 204],
[114, 235, 191]]), array([[ 63, 145, 171],
[ 16, 97, 173]]), array([[197, 36, 246],
[218, 75, 68]]), array([[223, 198, 84],
[206, 211, 151]]), array([[187, 132, 18],
[121, 212, 140]])]
Upvotes: 19
Reputation: 431
do like this:
a = [[1, 2, 3],[4, 5, 6],
[7, 8, 9],[10, 11, 12]]
b = a[0:2]
c = a[2:4]
Upvotes: -2