Reputation: 115
I have a pandas dataframe containing the following information:
A dataframe could look like this:
df =
timestamp t_idx position error type SNR
0 16229767 5 2 1 T1 123
1 16229767 5 1 0 T1 123
3 16229767 5 3 0 T1 123
4 16229767 5 4 0 T1 123
5 16229767 3 3 1 T9 38
6 16229767 3 1 0 T9 38
7 16229767 3 4 0 T9 38
8 29767162 7 1 0 T4 991
9 29767162 7 4 1 T4 991
If we look at the timestamp "16229767", there where 2 trays in use: Tray 3 and Tray 5. Each position for Tray 5 was detected. However, Tray 3 has missing data, as position 2 is missing.
I would like to fix that and add this line programmatically:
10 16229767 3 2 1 T9 38
11 29767162 7 2 1 T4 991
12 29767162 7 3 1 T4 991
I am not sure how to handle the missing values correctly. My naive approach right now is:
timestamps = df['timestamp'].unique()
for ts in timestamps:
tray_ids = df.loc[df['timestamp'] == timestamps ]["Tray ID"].unique()
for t_id in tray_ids:
# For timestamp and tray id: Each position (1 to 4) should exist once!
# df.loc[(df['timestamp'] == ts) & (df['Tray ID'] == t_id)]
# if not, append the position on the tray and set error to 1
How can I find the missing positions now and add the rows to my dataframe?
===
Edit: I was simplifying my example, but missed a relevant information: There are also other columns and the new generated rows should have the same content per tray. I made it clearer by adding to more columns.
Also, there was a question about the error: For each row that had to be added, the error should be automatically 1 (there is no logic behind).
Upvotes: 3
Views: 1145
Reputation: 28729
pyjanitor has a complete function that exposes explicitly missing values (pyjanitor is a collection of convenient Pandas functions);
In the challenge above, only the explicitly missing values within the data needs to be exposed:
# pip install pyjanitor
import pandas as pd
import janitor
(df.complete(['timestamp', 't_idx', 'type', 'SNR'], 'position')
.fillna({"error":1}, downcast='infer')
.filter(df.columns)
)
timestamp t_idx position error type SNR
0 16229767 5 2 1 T1 123
1 16229767 5 1 0 T1 123
2 16229767 5 3 0 T1 123
3 16229767 5 4 0 T1 123
4 16229767 3 2 1 T9 38
5 16229767 3 1 0 T9 38
6 16229767 3 3 1 T9 38
7 16229767 3 4 0 T9 38
8 29767162 7 2 1 T4 991
9 29767162 7 1 0 T4 991
10 29767162 7 3 1 T4 991
11 29767162 7 4 1 T4 991
In the code above, just the the combination of ['timestamp', 't_idx', 'type', 'SNR']
and position
is required to generate the missing values, limiting the output to only the explicit missing values within the dataframe; if all combinations of missing values were required, then the brackets would be dropped, and you'd probably get a much larger dataframe.
Upvotes: 1
Reputation: 3455
We can start by converting position
to the categorical type, use a groupby
to fill all the missing values and set the corresponding error
values to 1
.
We also have to fill the type
and SNR
column with the correct values like so :
>>> df['position'] = pd.Categorical(df['position'], categories=df['position'].unique())
>>> df_grouped = df.groupby(['timestamp', 't_idx', 'position'], as_index=False).first()
>>> df_grouped['error'] = df_grouped['error'].fillna(1)
>>> df_grouped.sort_values('type', inplace=True)
>>> df_grouped['type'] = df_grouped.groupby(['timestamp','t_idx'])['type'].ffill().bfill()
>>> df_grouped.sort_values('SNR', inplace=True)
>>> df_grouped['SNR'] = df_grouped.groupby(['timestamp','t_idx'])['SNR'].ffill().bfill()
>>> df_grouped = df_grouped.reset_index(drop=True)
timestamp t_idx position error type SNR
0 16229767 3 1 0.0 T9 38.0
1 16229767 3 3 1.0 T9 38.0
2 16229767 3 4 0.0 T9 38.0
3 16229767 5 2 1.0 T1 123.0
4 16229767 5 1 0.0 T1 123.0
5 16229767 5 3 0.0 T1 123.0
6 16229767 5 4 0.0 T1 123.0
7 29767162 7 1 0.0 T4 991.0
8 29767162 7 4 1.0 T4 991.0
9 16229767 3 2 1.0 T9 38.0
10 16229767 7 2 1.0 T4 991.0
11 16229767 7 1 1.0 T4 991.0
12 16229767 7 3 1.0 T4 991.0
13 16229767 7 4 1.0 T4 991.0
14 29767162 3 2 1.0 T4 991.0
15 29767162 3 1 1.0 T4 991.0
16 29767162 3 3 1.0 T4 991.0
17 29767162 3 4 1.0 T4 991.0
18 29767162 5 2 1.0 T4 991.0
19 29767162 5 1 1.0 T4 991.0
20 29767162 5 3 1.0 T4 991.0
21 29767162 5 4 1.0 T4 991.0
22 29767162 7 2 1.0 T4 991.0
23 29767162 7 3 1.0 T4 991.0
And then, we filter on the value from the original DataFrame
to get the expected result :
>>> df_grouped[
... pd.Series(
... list(zip(df_grouped['timestamp'].values, df_grouped['t_idx'].values))
... ).isin(list(zip(df['timestamp'].values, df['t_idx'].values)))
... ].sort_values(by=['timestamp', 't_idx']).reset_index(drop=True)
timestamp t_idx position error type SNR
0 16229767 3 1 0.0 T9 38.0
1 16229767 3 3 1.0 T9 38.0
2 16229767 3 4 0.0 T9 38.0
3 16229767 3 2 1.0 T9 38.0
4 16229767 5 2 1.0 T1 123.0
5 16229767 5 1 0.0 T1 123.0
6 16229767 5 3 0.0 T1 123.0
7 16229767 5 4 0.0 T1 123.0
8 29767162 7 1 0.0 T4 991.0
9 29767162 7 4 1.0 T4 991.0
10 29767162 7 2 1.0 T4 991.0
11 29767162 7 3 1.0 T4 991.0
Upvotes: 1
Reputation: 899
You can try this code:
def foo(df):
set_ = set(range(1,5))
if df.position.unique().size < 4:
diff_ = set_.difference(df.position.unique())
add_df = df.iloc[:len(diff_),:].copy()
add_df.loc[:, 'position'] = list(diff_)
# I did not understand by what rule the values in the error column are set. You can install it as you need
result_df = pd.concat([df, add_df], ignore_index=True)
return result_df
else:
return df
group = df.groupby(['timestamp', 't_idx'])
group.apply(foo)
timestamp t_idx position error
0 16229767 3 3 1
1 16229767 3 1 0
2 16229767 3 4 0
3 16229767 3 2 1
4 16229767 5 2 1
5 16229767 5 1 0
6 16229767 5 3 0
7 16229767 5 4 0
8 29767162 7 1 0
9 29767162 7 4 1
10 29767162 7 2 0
11 29767162 7 3 1
Upvotes: 0
Reputation: 799
You can create a new dataframe with the timestamp with fixed range of position. Then you merge them together and you will end up with NaN values on errors columns for given missing position. Then you fill the NaN to 1.
Sample code:
unique_id = df.timestamp.unique().tolist()
df_tmp = pd.DataFrame({'timestamp':unique_id,'position':range(4)})
df = pd.merge(df_tmp, df, on=["timestamp", "position"], how="left")
df.error.fillna(1)
Upvotes: 0