Drimer
Drimer

Reputation: 115

How to handle missing data in pandas dataframe?

I have a pandas dataframe containing the following information:

A dataframe could look like this:

df = 

     timestamp    t_idx  position  error    type    SNR
 0   16229767       5        2       1       T1     123
 1   16229767       5        1       0       T1     123
 3   16229767       5        3       0       T1     123
 4   16229767       5        4       0       T1     123
 5   16229767       3        3       1       T9      38
 6   16229767       3        1       0       T9      38
 7   16229767       3        4       0       T9      38
 8   29767162       7        1       0       T4     991
 9   29767162       7        4       1       T4     991 

If we look at the timestamp "16229767", there where 2 trays in use: Tray 3 and Tray 5. Each position for Tray 5 was detected. However, Tray 3 has missing data, as position 2 is missing.

I would like to fix that and add this line programmatically:

 10  16229767       3        2       1       T9      38

 11  29767162       7        2       1       T4     991 
 12  29767162       7        3       1       T4     991 

I am not sure how to handle the missing values correctly. My naive approach right now is:

timestamps = df['timestamp'].unique()
for ts in timestamps:
    tray_ids = df.loc[df['timestamp'] == timestamps ]["Tray ID"].unique()
    for t_id in tray_ids:
        # For timestamp and tray id: Each position (1 to 4) should exist once!
        # df.loc[(df['timestamp'] == ts) & (df['Tray ID'] == t_id)] 
        # if not, append the position on the tray and set error to 1

How can I find the missing positions now and add the rows to my dataframe?

===

Edit: I was simplifying my example, but missed a relevant information: There are also other columns and the new generated rows should have the same content per tray. I made it clearer by adding to more columns.

Also, there was a question about the error: For each row that had to be added, the error should be automatically 1 (there is no logic behind).

Upvotes: 3

Views: 1145

Answers (4)

sammywemmy
sammywemmy

Reputation: 28729

pyjanitor has a complete function that exposes explicitly missing values (pyjanitor is a collection of convenient Pandas functions);

In the challenge above, only the explicitly missing values within the data needs to be exposed:

# pip install pyjanitor
import pandas as pd
import janitor
(df.complete(['timestamp', 't_idx', 'type', 'SNR'], 'position')
   .fillna({"error":1}, downcast='infer')
   .filter(df.columns)
)
 
    timestamp  t_idx  position  error type  SNR
0    16229767      5         2      1   T1  123
1    16229767      5         1      0   T1  123
2    16229767      5         3      0   T1  123
3    16229767      5         4      0   T1  123
4    16229767      3         2      1   T9   38
5    16229767      3         1      0   T9   38
6    16229767      3         3      1   T9   38
7    16229767      3         4      0   T9   38
8    29767162      7         2      1   T4  991
9    29767162      7         1      0   T4  991
10   29767162      7         3      1   T4  991
11   29767162      7         4      1   T4  991

In the code above, just the the combination of ['timestamp', 't_idx', 'type', 'SNR'] and position is required to generate the missing values, limiting the output to only the explicit missing values within the dataframe; if all combinations of missing values were required, then the brackets would be dropped, and you'd probably get a much larger dataframe.

Upvotes: 1

tlentali
tlentali

Reputation: 3455

We can start by converting position to the categorical type, use a groupby to fill all the missing values and set the corresponding error values to 1.
We also have to fill the type and SNR column with the correct values like so :

>>> df['position'] = pd.Categorical(df['position'], categories=df['position'].unique())
>>> df_grouped = df.groupby(['timestamp', 't_idx', 'position'], as_index=False).first()
>>> df_grouped['error'] = df_grouped['error'].fillna(1)

>>> df_grouped.sort_values('type', inplace=True)
>>> df_grouped['type'] = df_grouped.groupby(['timestamp','t_idx'])['type'].ffill().bfill()

>>> df_grouped.sort_values('SNR', inplace=True)
>>> df_grouped['SNR'] = df_grouped.groupby(['timestamp','t_idx'])['SNR'].ffill().bfill()

>>> df_grouped = df_grouped.reset_index(drop=True)
    timestamp   t_idx   position    error   type    SNR
0   16229767    3       1           0.0     T9      38.0
1   16229767    3       3           1.0     T9      38.0
2   16229767    3       4           0.0     T9      38.0
3   16229767    5       2           1.0     T1      123.0
4   16229767    5       1           0.0     T1      123.0
5   16229767    5       3           0.0     T1      123.0
6   16229767    5       4           0.0     T1      123.0
7   29767162    7       1           0.0     T4      991.0
8   29767162    7       4           1.0     T4      991.0
9   16229767    3       2           1.0     T9      38.0
10  16229767    7       2           1.0     T4      991.0
11  16229767    7       1           1.0     T4      991.0
12  16229767    7       3           1.0     T4      991.0
13  16229767    7       4           1.0     T4      991.0
14  29767162    3       2           1.0     T4      991.0
15  29767162    3       1           1.0     T4      991.0
16  29767162    3       3           1.0     T4      991.0
17  29767162    3       4           1.0     T4      991.0
18  29767162    5       2           1.0     T4      991.0
19  29767162    5       1           1.0     T4      991.0
20  29767162    5       3           1.0     T4      991.0
21  29767162    5       4           1.0     T4      991.0
22  29767162    7       2           1.0     T4      991.0
23  29767162    7       3           1.0     T4      991.0

And then, we filter on the value from the original DataFrame to get the expected result :

>>> df_grouped[
...     pd.Series(
...         list(zip(df_grouped['timestamp'].values, df_grouped['t_idx'].values))
...     ).isin(list(zip(df['timestamp'].values, df['t_idx'].values)))
... ].sort_values(by=['timestamp', 't_idx']).reset_index(drop=True)
    timestamp   t_idx   position    error   type    SNR
0   16229767    3       1           0.0     T9      38.0
1   16229767    3       3           1.0     T9      38.0
2   16229767    3       4           0.0     T9      38.0
3   16229767    3       2           1.0     T9      38.0
4   16229767    5       2           1.0     T1      123.0
5   16229767    5       1           0.0     T1      123.0
6   16229767    5       3           0.0     T1      123.0
7   16229767    5       4           0.0     T1      123.0
8   29767162    7       1           0.0     T4      991.0
9   29767162    7       4           1.0     T4      991.0
10  29767162    7       2           1.0     T4      991.0
11  29767162    7       3           1.0     T4      991.0

Upvotes: 1

padu
padu

Reputation: 899

You can try this code:

def foo(df):
    set_ = set(range(1,5))
    if df.position.unique().size < 4:
        diff_ = set_.difference(df.position.unique())
        add_df = df.iloc[:len(diff_),:].copy()
        add_df.loc[:, 'position'] = list(diff_)
        # I did not understand by what rule the values in the error column are set. You can install it as you need
        result_df = pd.concat([df, add_df], ignore_index=True)
        return result_df
    else: 
        return df

group = df.groupby(['timestamp', 't_idx'])
group.apply(foo)

    timestamp   t_idx   position    error
0   16229767     3        3           1
1   16229767     3        1           0
2   16229767     3        4           0
3   16229767     3        2           1
4   16229767     5        2           1
5   16229767     5        1           0
6   16229767     5        3           0
7   16229767     5        4           0
8   29767162     7        1           0
9   29767162     7        4           1
10  29767162     7        2           0
11  29767162     7        3           1

Upvotes: 0

Raymond Toh
Raymond Toh

Reputation: 799

You can create a new dataframe with the timestamp with fixed range of position. Then you merge them together and you will end up with NaN values on errors columns for given missing position. Then you fill the NaN to 1.

Sample code:

unique_id = df.timestamp.unique().tolist()
df_tmp = pd.DataFrame({'timestamp':unique_id,'position':range(4)})
df = pd.merge(df_tmp, df, on=["timestamp", "position"], how="left")
df.error.fillna(1)

Upvotes: 0

Related Questions