Reputation: 1319
I have a grouped by Pandas dataframe:
id date temperature
1 2011-9-12 12
2011-9-12 20
2011-9-18 12
2011-9-19 90
2 2011-9-12 15
3 2011-9-12 15
2011-9-16 15
Here, each id has different numbers of records of temperature.
I want to fix them, to say the average number of records per id (say 3). If some records are missing I want to put zeros initially.
I want to keep the more recent records.
i.e. My final dataframe should be:
id temperature
1 20
12
90
2 0
0
15
3 0
15
15
Here is the numpy code which gives error on line:
s=df.groupby(level=0)['temperature'].apply(list)
s1=s.tolist()
arr = np.zeros((len(s1),3),int)
lens = [3-len(l) for l in s1]
mask = np.arange(3) >=np.array(lens)[:,None]
arr[mask] = np.concatenate(s1) ## Error
pd.DataFrame({'id':s.index.repeat(3),'temperature':arr.ravel()})
I suspect the error is due to the fact that my data can have more than 3 rows for an id.
How to fix the issue?
Upvotes: 2
Views: 200
Reputation: 828
A little lengthy solution but works:
df.groupby('id').apply(lambda x: x.sort_values(by='date'))
.drop('id', axis=1)['temperature'].groupby(level=0).tail(3)
.groupby(level=0).apply(lambda x: np.pad(x, (3-len(x),0), 'constant'))
.reset_index()
id temperature
0 1 [20, 12, 90]
1 2 [0, 0, 15]
2 3 [0, 15, 15]
Upvotes: 2
Reputation: 862791
Use GroupBy.cumcount
with ascending=False
for counter and Series.reindex
by MultiIndex
created by MultiIndex.from_product
:
print (df)
id date temperature
0 1 2011-9-12 12
1 1 2011-9-12 20
2 1 2011-9-18 12
3 1 2011-9-19 90
4 2 2011-9-12 15
5 3 2011-9-12 15
6 3 2011-9-16 15
N = 3
df['new'] = df.groupby('id').cumcount(ascending=False)
mux = pd.MultiIndex.from_product([df['id'].unique(), range(N-1, -1, -1)], names=['id','new'])
df1 = (df.set_index(['id', 'new'])['temperature']
.reindex(mux, fill_value=0)
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
id temperature
0 1 20
1 1 12
2 1 90
3 2 0
4 2 0
5 2 15
6 3 0
7 3 15
8 3 15
EDIT:
If MultiIndex DataFrame
:
print (df)
temperature
id date
1 2011-9-12 12
2011-9-12 20
2011-9-18 12
2011-9-19 90
2 2011-9-12 15
3 2011-9-12 15
2011-9-16 15
print (df.index)
MultiIndex(levels=[[1, 2, 3], ['2011-9-12', '2011-9-16', '2011-9-18', '2011-9-19']],
codes=[[0, 0, 0, 0, 1, 2, 2], [0, 0, 2, 3, 0, 0, 1]],
names=['id', 'date'])
N = 3
df['new'] = df.groupby('id').cumcount(ascending=False)
mux = pd.MultiIndex.from_product([df.index.levels[0], range(N-1, -1, -1)], names=['id','new'])
df1 = (df.reset_index(level=1, drop=True)
.set_index('new', append=True)['temperature']
.reindex(mux, fill_value=0)
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
id temperature
0 1 20
1 1 12
2 1 90
3 2 0
4 2 0
5 2 15
6 3 0
7 3 15
8 3 15
Upvotes: 3