Reputation: 2949
I have an array ar = [2,2,2,1,1,2,2,3,3,3,3]
.
For this array, I want to find the lengths of consecutive same numbers like:
values: 2, 1, 2, 3
lengths: 3, 2, 2, 4
In R
, this is obtained by using rle()
function. Is there any existing function in python which provides required output?
Upvotes: 13
Views: 4059
Reputation: 16081
You can do this with groupby
from itertools import groupby
ar = [2,2,2,1,1,2,2,3,3,3,3]
print([(k, sum(1 for i in g)) for k,g in groupby(ar)])
# [(2, 3), (1, 2), (2, 2), (3, 4)]
Upvotes: 19
Reputation: 2846
Here is an optimized answer using numpy arrays which runs quickly if the run lengths are long.
In this case I want to encode an array of uint16 that can be much
larger than 2**16
using 16 bit unsigned integer run length encoding.
To allow this the array is "chunked" so the indices never exceed
2**16
:
import numpy as np
def run_length_encode(array, chunksize=((1 << 16) - 1), dtype=np.int16):
"Chunked run length encoding for very large arrays containing smallish values."
shape = array.shape
ravelled = array.ravel()
length = len(ravelled)
chunk_cursor = 0
runlength_chunks = []
while chunk_cursor < length:
chunk_end = chunk_cursor + chunksize
chunk = ravelled[chunk_cursor : chunk_end]
chunk_length = len(chunk)
change = (chunk[:-1] != chunk[1:])
change_indices = np.nonzero(change)[0]
nchanges = len(change_indices)
cursor = 0
runlengths = np.zeros((nchanges + 1, 2), dtype=dtype)
for (count, index) in enumerate(change_indices):
next_cursor = index + 1
runlengths[count, 0] = chunk[cursor] # value
runlengths[count, 1] = next_cursor - cursor # run length
cursor = next_cursor
# last run
runlengths[nchanges, 0] = chunk[cursor]
runlengths[nchanges, 1] = chunk_length - cursor
runlength_chunks.append(runlengths)
chunk_cursor = chunk_end
all_runlengths = np.vstack(runlength_chunks).astype(dtype)
description = dict(
shape=shape,
runlengths=all_runlengths,
dtype=dtype,
)
return description
def run_length_decode(description):
dtype = description["dtype"]
runlengths = description["runlengths"]
shape = description["shape"]
array = np.zeros(shape, dtype=dtype)
ravelled = array.ravel()
cursor = 0
for (value, size) in runlengths:
run_end = cursor + size
ravelled[cursor : run_end] = value
cursor = run_end
array = ravelled.reshape(shape) # redundant?
return array
def testing():
A = np.zeros((50,), dtype=np.uint16)
A[20:30] = 10
A[30:35] = 6
A[40:] = 3
test = run_length_encode(A, chunksize=17)
B = run_length_decode(test)
assert np.alltrue(A == B)
print ("ok!")
if __name__=="__main__":
testing()
I built this for a project having to do with classifying microscopy images of mouse embryos.
https://github.com/flatironinstitute/mouse_embryo_labeller
Note: I edited the entry after I found I had to caste the type in this line to it to work for large arrays:
all_runlengths = np.vstack(runlength_chunks).astype(dtype)
Upvotes: 0
Reputation: 31908
Here is an answer using the high-performance pyrle library for run length arithmetic:
# pip install pyrle
# (pyrle >= 0.0.25)
from pyrle import Rle
v = [2,2,2,1,1,2,2,3,3,3,3]
r = Rle(v)
print(r)
# +--------+-----+-----+-----+-----+
# | Runs | 3 | 2 | 2 | 4 |
# |--------+-----+-----+-----+-----|
# | Values | 2 | 1 | 2 | 3 |
# +--------+-----+-----+-----+-----+
# Rle of length 11 containing 4 elements
print(r[4])
# 1.0
print(r[4:7])
# +--------+-----+-----+
# | Runs | 1 | 2 |
# |--------+-----+-----|
# | Values | 1.0 | 2.0 |
# +--------+-----+-----+
# Rle of length 3 containing 2 elements
r + r + 0.5
# +--------+-----+-----+-----+-----+
# | Runs | 3 | 2 | 2 | 4 |
# |--------+-----+-----+-----+-----|
# | Values | 4.5 | 2.5 | 4.5 | 6.5 |
# +--------+-----+-----+-----+-----+
# Rle of length 11 containing 4 elements
Upvotes: 3
Reputation: 31908
Here is an answer for pure numpy:
import numpy as np
def find_runs(x):
"""Find runs of consecutive items in an array."""
# ensure array
x = np.asanyarray(x)
if x.ndim != 1:
raise ValueError('only 1D array supported')
n = x.shape[0]
# handle empty array
if n == 0:
return np.array([]), np.array([]), np.array([])
else:
# find run starts
loc_run_start = np.empty(n, dtype=bool)
loc_run_start[0] = True
np.not_equal(x[:-1], x[1:], out=loc_run_start[1:])
run_starts = np.nonzero(loc_run_start)[0]
# find run values
run_values = x[loc_run_start]
# find run lengths
run_lengths = np.diff(np.append(run_starts, n))
return run_values, run_starts, run_lengths
Credit goes to https://github.com/alimanfoo
Upvotes: 3