Reputation: 1

Expanding/Modifying a data frame by adding data points based on difference between two columns

I have a data frame, which is the annotation of animal behavior recording frame by frame. But due to the limitation of the annotation software, the exported data frame is like this:

             Frame stop       Behavior
Frame start                           
0                     2        turning
2                     5     immobility
5                     9  mild movement
9                    11     immobility
16                   20           jump

Code:

data = {
    'Frame start': [0, 2, 5, 9, 16],
    'Frame stop': [2, 5, 9, 11, 20],
    'Behavior': ['turning', 'immobility', 'mild movement', 'immobility', 'jump']}

df = pd.DataFrame(data)
df.set_index('Frame start', inplace=True)

I want to modify the data frame to be like this:

Frame start
0           turning
1           turning
2        immobility
3        immobility
4        immobility
5     mild movement
6     mild movement
7     mild movement
8     mild movement
9        immobility
10       immobility
11              NaN
12              NaN
13              NaN
14              NaN
15              NaN
16             jump
17             jump
18             jump
19             jump

How can I do that?

Since I did not annotate every frame of a video, there are gaps of frames with no values, i.e. between index 4 and 5, frame 11-15 in this example. Therefore, I cannot re-index the data frame using the "frame start" column and simply <ffill> to fill up the missing values, which will look like the following:

new_index = [*range(0, 20)]
new_df = df.reindex(new_index, method='ffill')
final_df = new_df.loc[:, 'Behavior']

print(final_df)

Frame start
0           turning
1           turning
2        immobility
3        immobility
4        immobility
5     mild movement
6     mild movement
7     mild movement
8     mild movement
9        immobility
10       immobility
11       immobility
12       immobility
13       immobility
14       immobility
15       immobility
16             jump
17             jump
18             jump
19             jump

Upvotes: 0

Answers (3)

Panda Kim

Reputation: 13257

Use IntervalIndex

idx = pd.IntervalIndex.from_arrays(df.index, df['Frame stop'], closed='left')
val = range(df['Frame stop'].max()) # or set manually val = range(20)
out = pd.Series(val).map(df['Behavior'].set_axis(idx))

out

0           turning
1           turning
2        immobility
3        immobility
4        immobility
5     mild movement
6     mild movement
7     mild movement
8     mild movement
9        immobility
10       immobility
11              NaN
12              NaN
13              NaN
14              NaN
15              NaN
16             jump
17             jump
18             jump
19             jump

Upvotes: 1

PaulS

Reputation: 25393

Another possible solution, which uses:

np.arange to generate an array frames that ranges from the first index of df to the last value in the Frame stop column.
np.searchsorted to find the indices idx in df.index where the values in frames should be inserted to maintain order, adjusted by subtracting 1.
np.where to check if each value in frames is less than the corresponding value in the Frame stop column of df, indexed by idx. If true, it selects the corresponding value from the Behavior column; otherwise, it assigns NaN.
pd.Series to create a new series with the resulting values, using frames as the index and naming the series Behavior.

pd.Series(
    np.where(
        (frames := np.arange(df.index[0], df['Frame stop'].iloc[-1])) 
        < df['Frame stop'].iloc[
            (idx := np.searchsorted(df.index, frames, side='right') - 1)],
        df['Behavior'].iloc[idx],
        np.nan),
    index=frames,
    name='Behavior')

Output:

0           turning
1           turning
2        immobility
3        immobility
4        immobility
5     mild movement
6     mild movement
7     mild movement
8     mild movement
9        immobility
10       immobility
11              NaN
12              NaN
13              NaN
14              NaN
15              NaN
16             jump
17             jump
18             jump
19             jump
Name: Behavior, dtype: object

Upvotes: 0

Jay Livingston

Reputation: 101

Because you are forward-filling the Null values in Behavior and Frame stop after reindexing on a continuous Frame start, you only need to Null-out the Behavior values when the Frame stop is less than or equal to the Frame start index. You could do that with numpy.where() or pandas df.loc to overwrite Behavior:

import numpy as np

# you could use numpy.where for overwriting Behavior, or .loc
#new_df['Behavior'] = np.where(new_df['Frame stop'] <= new_df.index, np.nan, new_df['Behavior'])
new_df.loc[new_df['Frame stop'] <= new_df.index, 'Behavior'] = np.nan

print(new_df)

             Frame stop       Behavior
Frame start                           
0                     2        turning
1                     2        turning
2                     5     immobility
3                     5     immobility
4                     5     immobility
5                     9  mild movement
6                     9  mild movement
7                     9  mild movement
8                     9  mild movement
9                    11     immobility
10                   11     immobility
11                   11            NaN
12                   11            NaN
13                   11            NaN
14                   11            NaN
15                   11            NaN
16                   20           jump
17                   20           jump
18                   20           jump
19                   20           jump

Upvotes: 1

Expanding/Modifying a data frame by adding data points based on difference between two columns

Answers (3)

Related Questions