Reputation: 909
I'm working with a pd.Series which looks like this:
2013-01-02 NaN
2013-01-03 NaN
2013-01-04 NaN
2013-01-07 1.000000
2013-01-08 1.000000
2013-01-09 1.000000
2013-01-10 1.000000
2013-01-11 1.000000
2013-01-14 1.000000
2013-01-15 1.000000
2013-01-16 1.000000
2013-01-24 1.000000
2013-01-25 1.000000
2013-01-31 1.000000
2013-02-01 0
2013-02-04 0
2013-02-05 0
2013-02-11 -1.000000
2013-02-12 -1.000000
2013-02-13 -1.000000
2013-02-14 0
2013-02-15 0
2013-02-18 0
What I would like to do is get a Series like this:
2013-01-02 NaN
2013-01-03 NaN
2013-01-04 NaN
2013-01-07 1.000000
2013-01-08 1.000000
2013-01-09 1.000000
2013-01-10 1.000000
2013-01-11 1.000000
2013-01-14 1.000000
2013-01-15 1.000000
2013-01-16 1.000000
2013-01-24 1.000000
2013-01-25 1.000000
2013-01-31 1.000000
2013-02-01 0
2013-02-04 0
2013-02-05 0
2013-02-11 2.000000
2013-02-12 2.000000
2013-02-13 2.000000
2013-02-14 0
2013-02-15 0
2013-02-18 0
that is I would like to number the sequences of non-zeros and non-NaN values. I can't figure out a vectorized way to do it.
Upvotes: 0
Views: 92
Reputation: 353359
This is a little tricky, but uses patterns which come up frequently when dealing with contiguous-cluster groups. (Really we need to improve support for contiguous groupbys, but it would require changing the underlying data structures, which is why no one's gotten around to it yet.)
One way:
>>> cl = (ser.notnull() & (ser != 0))
>>> labels = ((cl != cl.shift()) & cl).cumsum() * cl + (ser * 0)
>>> labels
2013-01-02 NaN
2013-01-03 NaN
2013-01-04 NaN
2013-01-07 1
2013-01-08 1
2013-01-09 1
2013-01-10 1
2013-01-11 1
2013-01-14 1
2013-01-15 1
2013-01-16 1
2013-01-24 1
2013-01-25 1
2013-01-31 1
2013-02-01 0
2013-02-04 0
2013-02-05 0
2013-02-11 2
2013-02-12 2
2013-02-13 2
2013-02-14 0
2013-02-15 0
2013-02-18 0
dtype: float64
Some explanation follows. (To keep this short, I'm suppressing a lot of the repetition.)
First, we want to select the values which we're going to be labelling:
>>> cl = (ser.notnull() & (ser != 0))
>>> cl
2013-01-02 False
2013-01-03 False
2013-01-04 False
2013-01-07 True
...
2013-01-31 True
2013-02-01 False
2013-02-04 False
2013-02-05 False
2013-02-11 True
2013-02-12 True
2013-02-13 True
2013-02-14 False
2013-02-15 False
2013-02-18 False
dtype: bool
Now we find the start of each cluster by comparing it to a shifted version of itself:
>>> cl != cl.shift()
2013-01-02 True
2013-01-03 False
2013-01-04 False
2013-01-07 True
2013-01-08 False
2013-01-09 False
...
2013-01-31 False
2013-02-01 True
2013-02-04 False
2013-02-05 False
2013-02-11 True
2013-02-12 False
2013-02-13 False
2013-02-14 True
2013-02-15 False
2013-02-18 False
dtype: bool
But we only want the starts where a cluster we want to number begins:
>>> (cl != cl.shift()) & cl
2013-01-02 False
2013-01-03 False
2013-01-04 False
2013-01-07 True
2013-01-08 False
...
2013-02-05 False
2013-02-11 True
2013-02-12 False
2013-02-13 False
2013-02-14 False
2013-02-15 False
2013-02-18 False
dtype: bool
When we take the cumulative sum of these, since True==1 and False==0, we get a new number for each group:
>>> ((cl != cl.shift()) & cl).cumsum()
2013-01-02 0
2013-01-03 0
2013-01-04 0
2013-01-07 1
2013-01-08 1
...
2013-02-05 1
2013-02-11 2
2013-02-12 2
2013-02-13 2
2013-02-14 2
2013-02-15 2
2013-02-18 2
dtype: int64
But we don't want to number the groups where they're not part of a cluster:
>>> ((cl != cl.shift()) & cl).cumsum() * cl
2013-01-02 0
2013-01-03 0
2013-01-04 0
2013-01-07 1
2013-01-08 1
...
2013-01-31 1
2013-02-01 0
2013-02-04 0
2013-02-05 0
2013-02-11 2
2013-02-12 2
2013-02-13 2
2013-02-14 0
2013-02-15 0
2013-02-18 0
dtype: int64
And finally, we want the original NaNs to be preserved:
>>> ((cl != cl.shift()) & cl).cumsum() * cl + (ser * 0)
2013-01-02 NaN
2013-01-03 NaN
2013-01-04 NaN
2013-01-07 1
2013-01-08 1
...
2013-01-31 1
2013-02-01 0
2013-02-04 0
2013-02-05 0
2013-02-11 2
2013-02-12 2
2013-02-13 2
2013-02-14 0
2013-02-15 0
2013-02-18 0
dtype: float64
Upvotes: 1