Haroon Lone
Haroon Lone

Reputation: 2949

How to find Run length encoding in python

I have an array ar = [2,2,2,1,1,2,2,3,3,3,3]. For this array, I want to find the lengths of consecutive same numbers like:

 values: 2, 1, 2, 3
lengths: 3, 2, 2, 4

In R, this is obtained by using rle() function. Is there any existing function in python which provides required output?

Upvotes: 13

Views: 4059

Answers (4)

Rahul K P
Rahul K P

Reputation: 16081

You can do this with groupby

from itertools import groupby
ar = [2,2,2,1,1,2,2,3,3,3,3]
print([(k, sum(1 for i in g)) for k,g in groupby(ar)])
# [(2, 3), (1, 2), (2, 2), (3, 4)]

Upvotes: 19

Aaron Watters
Aaron Watters

Reputation: 2846

Here is an optimized answer using numpy arrays which runs quickly if the run lengths are long.

In this case I want to encode an array of uint16 that can be much larger than 2**16 using 16 bit unsigned integer run length encoding. To allow this the array is "chunked" so the indices never exceed 2**16:

import numpy as np

def run_length_encode(array, chunksize=((1 << 16) - 1), dtype=np.int16):
    "Chunked run length encoding for very large arrays containing smallish values."
    shape = array.shape
    ravelled = array.ravel()
    length = len(ravelled)
    chunk_cursor = 0
    runlength_chunks = []
    while chunk_cursor < length:
        chunk_end = chunk_cursor + chunksize
        chunk = ravelled[chunk_cursor : chunk_end]
        chunk_length = len(chunk)
        change = (chunk[:-1] != chunk[1:])
        change_indices = np.nonzero(change)[0]
        nchanges = len(change_indices)
        cursor = 0
        runlengths = np.zeros((nchanges + 1, 2), dtype=dtype)
        for (count, index) in enumerate(change_indices):
            next_cursor = index + 1
            runlengths[count, 0] = chunk[cursor] # value
            runlengths[count, 1] = next_cursor - cursor # run length
            cursor = next_cursor
        # last run
        runlengths[nchanges, 0] = chunk[cursor]
        runlengths[nchanges, 1] = chunk_length - cursor
        runlength_chunks.append(runlengths)
        chunk_cursor = chunk_end
    all_runlengths = np.vstack(runlength_chunks).astype(dtype)
    description = dict(
        shape=shape,
        runlengths=all_runlengths,
        dtype=dtype,
        )
    return description

def run_length_decode(description):
    dtype = description["dtype"]
    runlengths = description["runlengths"]
    shape = description["shape"]
    array = np.zeros(shape, dtype=dtype)
    ravelled = array.ravel()
    cursor = 0
    for (value, size) in runlengths:
        run_end = cursor + size
        ravelled[cursor : run_end] = value
        cursor = run_end
    array = ravelled.reshape(shape)  # redundant?
    return array

def testing():
    A = np.zeros((50,), dtype=np.uint16)
    A[20:30] = 10
    A[30:35] = 6
    A[40:] = 3
    test = run_length_encode(A, chunksize=17)
    B = run_length_decode(test)
    assert np.alltrue(A == B)
    print ("ok!")

if __name__=="__main__":
    testing()

I built this for a project having to do with classifying microscopy images of mouse embryos.

https://github.com/flatironinstitute/mouse_embryo_labeller

Note: I edited the entry after I found I had to caste the type in this line to it to work for large arrays:

all_runlengths = np.vstack(runlength_chunks).astype(dtype)

Upvotes: 0

The Unfun Cat
The Unfun Cat

Reputation: 31908

Here is an answer using the high-performance pyrle library for run length arithmetic:

# pip install pyrle
# (pyrle >= 0.0.25)

from pyrle import Rle

v = [2,2,2,1,1,2,2,3,3,3,3]

r = Rle(v)
print(r)
# +--------+-----+-----+-----+-----+
# | Runs   | 3   | 2   | 2   | 4   |
# |--------+-----+-----+-----+-----|
# | Values | 2   | 1   | 2   | 3   |
# +--------+-----+-----+-----+-----+
# Rle of length 11 containing 4 elements

print(r[4])
# 1.0

print(r[4:7])
# +--------+-----+-----+
# | Runs   | 1   | 2   |
# |--------+-----+-----|
# | Values | 1.0 | 2.0 |
# +--------+-----+-----+
# Rle of length 3 containing 2 elements

r + r + 0.5
# +--------+-----+-----+-----+-----+
# | Runs   | 3   | 2   | 2   | 4   |
# |--------+-----+-----+-----+-----|
# | Values | 4.5 | 2.5 | 4.5 | 6.5 |
# +--------+-----+-----+-----+-----+
# Rle of length 11 containing 4 elements

Upvotes: 3

The Unfun Cat
The Unfun Cat

Reputation: 31908

Here is an answer for pure numpy:

import numpy as np


def find_runs(x):
    """Find runs of consecutive items in an array."""

    # ensure array
    x = np.asanyarray(x)
    if x.ndim != 1:
        raise ValueError('only 1D array supported')
    n = x.shape[0]

    # handle empty array
    if n == 0:
        return np.array([]), np.array([]), np.array([])

    else:
        # find run starts
        loc_run_start = np.empty(n, dtype=bool)
        loc_run_start[0] = True
        np.not_equal(x[:-1], x[1:], out=loc_run_start[1:])
        run_starts = np.nonzero(loc_run_start)[0]

        # find run values
        run_values = x[loc_run_start]

        # find run lengths
        run_lengths = np.diff(np.append(run_starts, n))

        return run_values, run_starts, run_lengths

Credit goes to https://github.com/alimanfoo

Upvotes: 3

Related Questions