Reputation: 31
I would need your help to keep track of previous observations in a pandas data frame.
Example:
df = pd.DataFrame({'ID':[1,2,3,1,2,4,1,3,5]},
index = ['2015-01-01','2015-01-01','2015-01-01',
'2015-01-02','2015-01-02','2015-01-02',
'2015-01-03','2015-01-03','2015-01-03'])
I would like to return a simple output (pandas dataframe or list), containing the date and the observations on that day plus any other unique observation seen previously
Desired Output:
2015-01-01 [1,2,3]
2015-01-02 [1,2,3,4]
2015-01-03 [1,2,3,4,5]
I have the feeling this would be a very simple operation in pandas but I am lost. Could you please help and point me in the right direction?
Thanks
Upvotes: 1
Views: 347
Reputation: 353059
This might not be the best idea if there are an enormous number of values, and I remind you that pandas objects aren't really designed for holding non-scalar objects, but you could use get_dummies
:
>>> d = pd.get_dummies(df["ID"])
>>> m = d.cummax().groupby(level=0).last()
>>> m.apply(lambda x: m.columns[x.astype(bool)].tolist(), axis=1)
2015-01-01 [1, 2, 3]
2015-01-02 [1, 2, 3, 4]
2015-01-03 [1, 2, 3, 4, 5]
dtype: object
This works because dummies gives us a membership table:
>>> d
1 2 3 4 5
2015-01-01 1 0 0 0 0
2015-01-01 0 1 0 0 0
2015-01-01 0 0 1 0 0
2015-01-02 1 0 0 0 0
2015-01-02 0 1 0 0 0
2015-01-02 0 0 0 1 0
2015-01-03 1 0 0 0 0
2015-01-03 0 0 1 0 0
2015-01-03 0 0 0 0 1
and we can then accumulate these:
>>> d.cummax().groupby(level=0).last()
1 2 3 4 5
2015-01-01 1 1 1 0 0
2015-01-02 1 1 1 1 0
2015-01-03 1 1 1 1 1
Upvotes: 2
Reputation: 109546
Getting the unique observations by date is easy:
daily_observations = df.groupby(level=0).ID.apply(lambda s: s.unique())
>>> daily_observations
2015-01-01 [1, 2, 3]
2015-01-02 [1, 2, 4]
2015-01-03 [1, 3, 5]
Name: ID, dtype: object
Accumulating them is more difficult:
>>> pd.Series({idx: list(set().union(*[v.tolist()
for v in daily_observations.loc[daily_observations.index <= idx].values]))
for idx in daily_observations.index})
2015-01-01 [1, 2, 3]
2015-01-02 [1, 2, 3, 4]
2015-01-03 [1, 2, 3, 4, 5]
dtype: object
The part of the code pulls the cumulative list of lists for all occurrences.
>>> [daily_observations.loc[daily_observations.index <= idx].values for idx in daily_observations.index]
[array([array([1, 2, 3])], dtype=object),
array([array([1, 2, 3]), array([1, 2, 4])], dtype=object),
array([array([1, 2, 3]), array([1, 2, 4]), array([1, 3, 5])], dtype=object)]
Which is then converted from arrays to a list of lists for each index value:
>>> [[v.tolist() for v in daily_observations.loc[daily_observations.index <= idx].values] for idx in daily_observations.index]
[[[1, 2, 3]], [[1, 2, 3], [1, 2, 4]], [[1, 2, 3], [1, 2, 4], [1, 3, 5]]]
These lists of lists are then converted to sets via a union and then converted back to lists.
>>> [list(set().union(*[v.tolist()
for v in daily_observations.loc[daily_observations.index <= idx].values]))
for idx in daily_observations.index]
[[1, 2, 3], [1, 2, 3, 4], [1, 2, 3, 4, 5]]
All the above is then wrapped in a dictionary comprehension which is then used to construct the Series.
Upvotes: 1