Reputation: 31

Keeping track of occurrence of unique IDs in time series

I would need your help to keep track of previous observations in a pandas data frame.

Example:

df = pd.DataFrame({'ID':[1,2,3,1,2,4,1,3,5]},
                  index = ['2015-01-01','2015-01-01','2015-01-01',
                           '2015-01-02','2015-01-02','2015-01-02',
                           '2015-01-03','2015-01-03','2015-01-03'])

I would like to return a simple output (pandas dataframe or list), containing the date and the observations on that day plus any other unique observation seen previously

Desired Output:

2015-01-01    [1,2,3]
2015-01-02    [1,2,3,4]
2015-01-03    [1,2,3,4,5]

I have the feeling this would be a very simple operation in pandas but I am lost. Could you please help and point me in the right direction?

Thanks

Upvotes: 1

Answers (2)

DSM

Reputation: 353059

This might not be the best idea if there are an enormous number of values, and I remind you that pandas objects aren't really designed for holding non-scalar objects, but you could use get_dummies:

>>> d = pd.get_dummies(df["ID"])
>>> m = d.cummax().groupby(level=0).last()
>>> m.apply(lambda x: m.columns[x.astype(bool)].tolist(), axis=1)
2015-01-01          [1, 2, 3]
2015-01-02       [1, 2, 3, 4]
2015-01-03    [1, 2, 3, 4, 5]
dtype: object

This works because dummies gives us a membership table:

>>> d
            1  2  3  4  5
2015-01-01  1  0  0  0  0
2015-01-01  0  1  0  0  0
2015-01-01  0  0  1  0  0
2015-01-02  1  0  0  0  0
2015-01-02  0  1  0  0  0
2015-01-02  0  0  0  1  0
2015-01-03  1  0  0  0  0
2015-01-03  0  0  1  0  0
2015-01-03  0  0  0  0  1

and we can then accumulate these:

>>> d.cummax().groupby(level=0).last()
            1  2  3  4  5
2015-01-01  1  1  1  0  0
2015-01-02  1  1  1  1  0
2015-01-03  1  1  1  1  1

Upvotes: 2

Alexander

Reputation: 109546

Getting the unique observations by date is easy:

daily_observations = df.groupby(level=0).ID.apply(lambda s: s.unique())

>>> daily_observations
2015-01-01    [1, 2, 3]
2015-01-02    [1, 2, 4]
2015-01-03    [1, 3, 5]
Name: ID, dtype: object

Accumulating them is more difficult:

>>> pd.Series({idx: list(set().union(*[v.tolist() 
                    for v in daily_observations.loc[daily_observations.index <= idx].values])) 
               for idx in daily_observations.index})

    2015-01-01          [1, 2, 3]
    2015-01-02       [1, 2, 3, 4]
    2015-01-03    [1, 2, 3, 4, 5]
    dtype: object

The part of the code pulls the cumulative list of lists for all occurrences.

>>> [daily_observations.loc[daily_observations.index <= idx].values for idx in daily_observations.index] 
[array([array([1, 2, 3])], dtype=object),
 array([array([1, 2, 3]), array([1, 2, 4])], dtype=object),
 array([array([1, 2, 3]), array([1, 2, 4]), array([1, 3, 5])], dtype=object)]

Which is then converted from arrays to a list of lists for each index value:

>>> [[v.tolist() for v in daily_observations.loc[daily_observations.index <= idx].values] for idx in daily_observations.index]
[[[1, 2, 3]], [[1, 2, 3], [1, 2, 4]], [[1, 2, 3], [1, 2, 4], [1, 3, 5]]]

These lists of lists are then converted to sets via a union and then converted back to lists.

>>> [list(set().union(*[v.tolist() 
                        for v in daily_observations.loc[daily_observations.index <= idx].values])) 
          for idx in daily_observations.index]
[[1, 2, 3], [1, 2, 3, 4], [1, 2, 3, 4, 5]]

All the above is then wrapped in a dictionary comprehension which is then used to construct the Series.

Upvotes: 1

Keeping track of occurrence of unique IDs in time series

Answers (2)

Related Questions