Reputation: 6290
I have the following Pandas DataFrame:
start_timestamp_milli end_timestamp_milli name rating
1 1555414708025 1555414723279 Valence 2
2 1555414708025 1555414723279 Arousal 6
3 1555414708025 1555414723279 Dominance 2
4 1555414708025 1555414723279 Sadness 1
5 1555414813304 1555414831795 Valence 3
6 1555414813304 1555414831795 Arousal 5
7 1555414813304 1555414831795 Dominance 2
8 1555414813304 1555414831795 Sadness 1
9 1555414921819 1555414931382 Valence 1
10 1555414921819 1555414931382 Arousal 7
11 1555414921819 1555414931382 Dominance 2
12 1555414921819 1555414931382 Sadness 1
13 1555414921819 1555414931382 Anger 1
In the above example there are three groups which can be grouped by start_timestamp_milli and end_timestamp_milli. The first group is from index 1 to 4, the second group from index 5 - 8 and the third group from index 9 to 13.
For each such group, if in the name column "Anger" and Happiness" is not present, I would like to insert it with a rating of 0. If it is present, nothing should happen.
The final result should look like this. The added lines are line 5, 6, 11, 12 and 18.
start_timestamp_milli end_timestamp_milli name rating
1 1555414708025 1555414723279 Valence 2
2 1555414708025 1555414723279 Arousal 6
3 1555414708025 1555414723279 Dominance 2
4 1555414708025 1555414723279 Sadness 1
5 1555414708025 1555414723279 Happiness 0
6 1555414708025 1555414723279 Anger 0
7 1555414813304 1555414831795 Valence 3
8 1555414813304 1555414831795 Arousal 5
9 1555414813304 1555414831795 Dominance 2
10 1555414813304 1555414831795 Sadness 1
11 1555414813304 1555414831795 Happiness 0
12 1555414813304 1555414831795 Anger 0
13 1555414921819 1555414931382 Valence 1
14 1555414921819 1555414931382 Arousal 7
15 1555414921819 1555414931382 Dominance 2
16 1555414921819 1555414931382 Sadness 1
17 1555414921819 1555414931382 Happiness 0
18 1555414921819 1555414931382 Anger 1
How can this be done?
Upvotes: 1
Views: 147
Reputation: 294278
This very explicitly loops through each group and appends dummy dataframe and drops duplicates.
d = dict(name=['Anger', 'Happiness'], rating=0)
cols = ['start_timestamp_milli', 'end_timestamp_milli']
def f(d0, k):
d1 = pd.DataFrame({**dict(zip(cols, k)), **d})
return d0.append(d1, ignore_index=True).drop_duplicates('name')
pd.concat([f(d, k) for k, d in df.groupby(cols)], ignore_index=True)
start_timestamp_milli end_timestamp_milli name rating
0 1555414708025 1555414723279 Valence 2
1 1555414708025 1555414723279 Arousal 6
2 1555414708025 1555414723279 Dominance 2
3 1555414708025 1555414723279 Sadness 1
4 1555414708025 1555414723279 Anger 0
5 1555414708025 1555414723279 Happiness 0
6 1555414813304 1555414831795 Valence 3
7 1555414813304 1555414831795 Arousal 5
8 1555414813304 1555414831795 Dominance 2
9 1555414813304 1555414831795 Sadness 1
10 1555414813304 1555414831795 Anger 0
11 1555414813304 1555414831795 Happiness 0
12 1555414921819 1555414931382 Valence 1
13 1555414921819 1555414931382 Arousal 7
14 1555414921819 1555414931382 Dominance 2
15 1555414921819 1555414931382 Sadness 1
16 1555414921819 1555414931382 Anger 1
17 1555414921819 1555414931382 Happiness 0
This builds a new index and uses reindex
cats = ['Anger', 'Happiness']
cols = ['start_timestamp_milli', 'end_timestamp_milli']
d = df.set_index([*cols, 'name'])
i = pd.MultiIndex.from_tuples(
[(s, e, n) for s, e in {*zip(*map(df.get, cols))} for n in cats],
names=d.index.names
) | d.index
df.set_index([*cols, 'name']).reindex(i, fill_value=0).reset_index()
start_timestamp_milli end_timestamp_milli name rating
0 1555414708025 1555414723279 Anger 0
1 1555414708025 1555414723279 Arousal 6
2 1555414708025 1555414723279 Dominance 2
3 1555414708025 1555414723279 Happiness 0
4 1555414708025 1555414723279 Sadness 1
5 1555414708025 1555414723279 Valence 2
6 1555414813304 1555414831795 Anger 0
7 1555414813304 1555414831795 Arousal 5
8 1555414813304 1555414831795 Dominance 2
9 1555414813304 1555414831795 Happiness 0
10 1555414813304 1555414831795 Sadness 1
11 1555414813304 1555414831795 Valence 3
12 1555414921819 1555414931382 Anger 1
13 1555414921819 1555414931382 Arousal 7
14 1555414921819 1555414931382 Dominance 2
15 1555414921819 1555414931382 Happiness 0
16 1555414921819 1555414931382 Sadness 1
17 1555414921819 1555414931382 Valence 1
Upvotes: 3
Reputation: 323266
I am using unstack
+ stack
+ reindex
s=set(df.name.unique().tolist()+['Anger','Happiness'])
df.set_index(df.columns[:-1].tolist()).rating.\
unstack(fill_value=0).\
reindex(columns=s,fill_value=0).\
stack().reset_index()
Upvotes: 3