Seirra
Seirra

Reputation: 131

Replace 1's with 0's in a sequence

I have a huge list of 1's and 0's like this :

x = [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1].  

Full list here.

I want to create a new list y with the condition that , the 1's should be preserved only if the they occur in a sequence of >= than 10, else those 1's should be replaced by zeroes.
ex based on x above ^ , y should become:

y = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1].  

So far I have the following :

  1. Finding out where the changes are occurring and
  2. Finding out what sequences are occurring with what frequency:

import numpy as np
import itertools
nx = np.array(x)
print np.argwhere(np.diff(nx)).squeeze()

answer = []
for key, iter in itertools.groupby(nx):
    answer.append((key, len(list(iter))))
print answer

which gives me :

[0 3 8 14]  # A
[(1, 1), (0, 3), (1, 5), (0, 6), (1, 10)] # B

#A which means the changes are happening after the 0th, 3rd and so on positions.

#B means there is one 1, followed by three 0's followed by five 1's followed by 6 zeroes followed by 10 1's.

How do I proceed to the final step of creating y where we will be replacing the 1's with 0's depending upon the sequence length?

PS: ##I'm humbled by these brilliant solutions from all the wonderful people.

Upvotes: 5

Views: 222

Answers (6)

Kasravnd
Kasravnd

Reputation: 107347

If you wanna use Numpy here is one Vectorized approach:

ind = np.where(np.diff(np.concatenate(([0], np.where(np.diff(x) != 0)[0], [x.size]))) >= 10)[0] - 1
x[vrange(d[ind] + 1, d[ind + 1] + 2)] = 0

If you want to use Python, here is an approach using itertools.chain, itertools.repeat and itertools.groupby within a list-comprehension:

chain.from_iterable(repeat(0, len(i)) if len(i) >= 10 else i for i in [list(g) for _, g in groupby(x)])

Demos:

# Python

In [28]: list(chain.from_iterable(repeat(0, len(i)) if len(i) >= 10 else i for i in [list(g) for _, g in groupby(x)]))
Out[28]: [1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

# Numpy

In [161]: x = np.array([1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1, 0, 0, 1, 1, 1, 1, 1, 1 ,1, 1, 1, 1, 0, 0])

In [162]: d = np.where(np.diff(x) != 0)[0]

In [163]: d2 = np.diff(np.concatenate(([0], d, [x.size])))

In [164]: ind = np.where(d2 >= 10)[0] - 1

In [165]: def vrange(starts, stops):
     ...:     stops = np.asarray(stops)
     ...:     l = stops - starts # Lengths of each range.
     ...:     return np.repeat(stops - l.cumsum(), l) + np.arange(l.sum())
     ...: 

In [166]: x[vrange(d[ind] + 1, d[ind + 1] + 2)] = 0

In [167]: x
Out[167]: 
array([1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

For Vrange I used this answer https://codereview.stackexchange.com/questions/83018/vectorized-numpy-version-of-arange-with-multiple-start-stop but I think there might be more optimized approaches for that.

Upvotes: 2

Paul Panzer
Paul Panzer

Reputation: 53089

Here is another numpy approach. Please take note of the benchmarks at the bottom of this post:

import numpy as np
import pandas as pd
from itertools import groupby
import re
from timeit import timeit

def f_pp(data):
    switches = np.empty((data.size + 1,), bool)
    switches[0] = data[0]
    switches[-1] = data[-1]
    switches[1:-1] = data[:-1]^data[1:]
    switches = np.where(switches)[0].reshape(-1, 2)
    switches = switches[switches[:, 1]-switches[:, 0] >= 10].ravel()
    reps = np.empty((switches.size + 1,), int)
    reps[1:-1] = np.diff(switches)
    reps[0] = switches[0]
    reps[-1] = data.size - switches[-1]
    return np.repeat(np.arange(reps.size) & 1, reps)

def f_ja(data):
    result = []
    for k, g in groupby(data):
        if k:
            g = list(g)
            if len(g) < 10:
                g = len(g)*[0]
        result.extend(g)
    return result

def f_mu(s):
    s = s.copy()
    s[s.groupby(s.ne(1).cumsum()).transform('count').lt(10)] = 0
    return s

def vrange(starts, stops):
     stops = np.asarray(stops)
     l = stops - starts # Lengths of each range.
     return np.repeat(stops - l.cumsum(), l) + np.arange(l.sum())

def f_ka(data):
    x = data.copy()
    d = np.where(np.diff(x) != 0)[0]
    d2 = np.diff(np.concatenate(([0], d, [x.size])))
    ind = np.where(d2 >= 10)[0] - 1
    x[vrange(d[ind] + 1, d[ind + 1] + 2)] = 0
    return x

def f_ol(data):
    return list(re.sub(b'(?<!\x01)\x01{,9}(?!\x01)', lambda m: len(m.group()) * b'\x00', bytes(data)))

n = 10_000
data = np.repeat((np.arange(n) + np.random.randint(2))&1, np.random.randint(1, 20, (n,)))
datal = data.tolist()
datap = pd.Series(data)

kwds = dict(globals=globals(), number=100)

print(np.where(f_ja(datal) != f_pp(data))[0])
print(np.where(f_ol(datal) != f_pp(data))[0])
#print(np.where(f_ka(data) != f_pp(data))[0])
print(np.where(f_mu(datap).values != f_pp(data))[0])

print('itertools.groupby: {:6.3f} ms'.format(10 * timeit('f_ja(datal)', **kwds)))
print('re:                {:6.3f} ms'.format(10 * timeit('f_ol(datal)', **kwds)))
#print('numpy Kasramvd:    {:6.3f} ms'.format(10 * timeit('f_ka(data)', **kwds)))
print('pandas:            {:6.3f} ms'.format(10 * timeit('f_mu(datap)', **kwds)))
print('numpy pp:          {:6.3f} ms'.format(10 * timeit('f_pp(data)', **kwds)))

Sample output:

[]                                        # Delta ja, pp
[]                                        # Delta ol, pp
[  749   750   751 ... 98786 98787 98788] # Delta mu, pp
itertools.groupby:  5.415 ms
re:                28.197 ms
pandas:            14.972 ms
numpy pp:           0.788 ms

Note only from scratch solutions considered. @Olivier's @juanpa.arrivillaga's and my approach yield same answer, @MaxU's doesn't. Couldn't get @Kazramvd's to finish reliably. (May be my fault - don't know pandas and didn't fully understand @Kazramvd's solution).

Note that is only one example, other conditions (like shorter lists, more switches, etc.) may change the ranking.

Upvotes: 4

Olivier Melan&#231;on
Olivier Melan&#231;on

Reputation: 22324

With list comprehension

From your encoded list B, you can use list comprehension to generate the new list.

b = [(1, 1), (0, 3), (1, 5), (0, 6), (1, 10)] # B

y = sum(([num and int(rep >= 10)] * rep for num, rep in b), [])

From the start with re

Alternatively, from the start this looks like something re could do as it can work with bytes.

import re

x = [1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

y = list(re.sub(b'(?<!\x01)\x01{,9}(?!\x01)', lambda m: len(m.group()) * b'\x00', bytes(x)))

Both solutions output:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Upvotes: 2

juanpa.arrivillaga
juanpa.arrivillaga

Reputation: 96267

Just check while you are iterating over the group-by. Something like:

>>> from itertools import groupby
>>> x = [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1]
>>> result = []
>>> for k, g in groupby(x):
...     if k:
...         g = list(g)
...         if len(g) < 10:
...             g = len(g)*[0]
...     result.extend(g)
...
>>> result
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Note, this is faster than the corresponding pandas solution, for a dataset of this size at least:

In [11]: from itertools import groupby

In [12]: %%timeit
    ...: result = []
    ...: for k, g in groupby(x):
    ...:     if k:
    ...:         g = list(g)
    ...:         if len(g) < 10:
    ...:             g = len(g)*[0]
    ...:     result.extend(g)
    ...:
181 µs ± 1.72 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [13]: %%timeit s = pd.Series(x)
    ...: s[s.groupby(s.ne(1).cumsum()).transform('count').lt(10)] = 0
    ...:
4.03 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And note, that's being generous with the pandas solution, not counting any time converting from list to pd.Series or converting back, including those:

In [14]: %%timeit
    ...: s = pd.Series(x)
    ...: s[s.groupby(s.ne(1).cumsum()).transform('count').lt(10)] = 0
    ...: s = s.tolist()
    ...:
4.92 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 6

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210962

using Pandas:

import pandas as pd

In [130]: s = pd.Series(x)

In [131]: s
Out[131]:
0     1
1     0
2     0
3     0
4     1
     ..
20    1
21    1
22    1
23    1
24    1
Length: 25, dtype: int64

In [132]: s[s.groupby(s.ne(1).cumsum()).transform('count').lt(10)] = 0

In [133]: s.tolist()
Out[133]: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

In [134]: s
Out[134]:
0     0
1     0
2     0
3     0
4     0
     ..
20    1
21    1
22    1
23    1
24    1
Length: 25, dtype: int64

for your "huge" list it takes approx. 7 ms on my old notebook:

In [141]: len(x)
Out[141]: 5124

In [142]: %%timeit
     ...: s = pd.Series(x)
     ...: s[s.groupby(s.ne(1).cumsum()).transform('count').lt(10)] = 0
     ...: res = s.tolist()
     ...:
6.56 ms ± 16.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 1

Adi219
Adi219

Reputation: 4814

Try this:

y = []
for pair in b: ## b is the list which you called #B
    add = 0
    if pair[0] == 1 and pair[1] > 9:
        add = 1
    y.extend([add] * pair[1])

Upvotes: 1

Related Questions