Pandas Label Sequences of 1 in a 0-1 valued Series

Question

I'm working with a pd.Series which looks like this:

2013-01-02        NaN
2013-01-03        NaN
2013-01-04        NaN
2013-01-07   1.000000
2013-01-08   1.000000
2013-01-09   1.000000
2013-01-10   1.000000
2013-01-11   1.000000
2013-01-14   1.000000
2013-01-15   1.000000
2013-01-16   1.000000
2013-01-24   1.000000
2013-01-25   1.000000
2013-01-31   1.000000
2013-02-01          0
2013-02-04          0
2013-02-05          0
2013-02-11  -1.000000
2013-02-12  -1.000000
2013-02-13  -1.000000
2013-02-14          0
2013-02-15          0
2013-02-18          0

What I would like to do is get a Series like this:

2013-01-02        NaN
2013-01-03        NaN
2013-01-04        NaN
2013-01-07   1.000000
2013-01-08   1.000000
2013-01-09   1.000000
2013-01-10   1.000000
2013-01-11   1.000000
2013-01-14   1.000000
2013-01-15   1.000000
2013-01-16   1.000000
2013-01-24   1.000000
2013-01-25   1.000000
2013-01-31   1.000000
2013-02-01          0
2013-02-04          0
2013-02-05          0
2013-02-11   2.000000
2013-02-12   2.000000
2013-02-13   2.000000
2013-02-14          0
2013-02-15          0
2013-02-18          0

that is I would like to number the sequences of non-zeros and non-NaN values. I can't figure out a vectorized way to do it.

DSM · Accepted Answer

This is a little tricky, but uses patterns which come up frequently when dealing with contiguous-cluster groups. (Really we need to improve support for contiguous groupbys, but it would require changing the underlying data structures, which is why no one's gotten around to it yet.)

One way:

>>> cl = (ser.notnull() & (ser != 0))
>>> labels = ((cl != cl.shift()) & cl).cumsum() * cl + (ser * 0)
>>> labels
2013-01-02   NaN
2013-01-03   NaN
2013-01-04   NaN
2013-01-07     1
2013-01-08     1
2013-01-09     1
2013-01-10     1
2013-01-11     1
2013-01-14     1
2013-01-15     1
2013-01-16     1
2013-01-24     1
2013-01-25     1
2013-01-31     1
2013-02-01     0
2013-02-04     0
2013-02-05     0
2013-02-11     2
2013-02-12     2
2013-02-13     2
2013-02-14     0
2013-02-15     0
2013-02-18     0
dtype: float64

Some explanation follows. (To keep this short, I'm suppressing a lot of the repetition.)

First, we want to select the values which we're going to be labelling:

>>> cl = (ser.notnull() & (ser != 0))
>>> cl
2013-01-02    False
2013-01-03    False
2013-01-04    False
2013-01-07     True
...
2013-01-31     True
2013-02-01    False
2013-02-04    False
2013-02-05    False
2013-02-11     True
2013-02-12     True
2013-02-13     True
2013-02-14    False
2013-02-15    False
2013-02-18    False
dtype: bool

Now we find the start of each cluster by comparing it to a shifted version of itself:

>>> cl != cl.shift()
2013-01-02     True
2013-01-03    False
2013-01-04    False
2013-01-07     True
2013-01-08    False
2013-01-09    False
...
2013-01-31    False
2013-02-01     True
2013-02-04    False
2013-02-05    False
2013-02-11     True
2013-02-12    False
2013-02-13    False
2013-02-14     True
2013-02-15    False
2013-02-18    False
dtype: bool

But we only want the starts where a cluster we want to number begins:

>>> (cl != cl.shift()) & cl
2013-01-02    False
2013-01-03    False
2013-01-04    False
2013-01-07     True
2013-01-08    False
...
2013-02-05    False
2013-02-11     True
2013-02-12    False
2013-02-13    False
2013-02-14    False
2013-02-15    False
2013-02-18    False
dtype: bool

When we take the cumulative sum of these, since True==1 and False==0, we get a new number for each group:

>>> ((cl != cl.shift()) & cl).cumsum()
2013-01-02    0
2013-01-03    0
2013-01-04    0
2013-01-07    1
2013-01-08    1
...
2013-02-05    1
2013-02-11    2
2013-02-12    2
2013-02-13    2
2013-02-14    2
2013-02-15    2
2013-02-18    2
dtype: int64

But we don't want to number the groups where they're not part of a cluster:

>>> ((cl != cl.shift()) & cl).cumsum() * cl
2013-01-02    0
2013-01-03    0
2013-01-04    0
2013-01-07    1
2013-01-08    1
...
2013-01-31    1
2013-02-01    0
2013-02-04    0
2013-02-05    0
2013-02-11    2
2013-02-12    2
2013-02-13    2
2013-02-14    0
2013-02-15    0
2013-02-18    0
dtype: int64

And finally, we want the original NaNs to be preserved:

>>> ((cl != cl.shift()) & cl).cumsum() * cl + (ser * 0)
2013-01-02   NaN
2013-01-03   NaN
2013-01-04   NaN
2013-01-07     1
2013-01-08     1
...
2013-01-31     1
2013-02-01     0
2013-02-04     0
2013-02-05     0
2013-02-11     2
2013-02-12     2
2013-02-13     2
2013-02-14     0
2013-02-15     0
2013-02-18     0
dtype: float64

Pandas Label Sequences of 1 in a 0-1 valued Series

Answers (1)

Related Questions