Reputation: 1224

What is the most efficient way to fill missing values in this data frame?

I have the following pandas dataframe :

df = pd.DataFrame([
    ['A', 2017, 1],
    ['A', 2019, 1],
    ['B', 2017, 1],
    ['B', 2018, 1],
    ['C', 2016, 1],
    ['C', 2019, 1],
], columns=['ID', 'year', 'number'])

and am looking for the most efficient way to fill the missing years with a default value of 0 for the column number

The expected output is:

  ID  year  number
0  A  2017       1
1  A  2018       0
2  A  2019       1
3  B  2017       1
4  B  2018       1
5  C  2016       1
6  C  2017       0
7  C  2018       0
8  C  2019       1

The dataframe that I have is relatively big, so I am looking for an efficient solution.

Edit:

This is the code that I have so far:

min_max_dict = df[['ID', 'year']].groupby('ID').agg([min, max]).to_dict('index')

new_ix = [[], []]
for id_ in df['ID'].unique():
    for year in range(min_max_dict[id_][('year', 'min')], min_max_dict[id_][('year', 'max')]+1): 
        new_ix[0].append(id_)
        new_ix[1].append(year)


df.set_index(['ID', 'year'], inplace=True)
df = df.reindex(new_ix, fill_value=0).reset_index()

Result

  ID  year  number
0  A  2017       1
1  A  2018       0
2  A  2019       1
3  B  2017       1
4  B  2018       1
5  C  2016       1
6  C  2017       0
7  C  2018       0
8  C  2019       1

Upvotes: 24

Answers (8)

sammywemmy

Reputation: 28729

We could use the complete function from pyjanitor, which provides a convenient abstraction to generate the missing rows, in this case per ID group:

# pip install pyjanitor
import pandas as pd
import janitor as jn

# create mapping for range of years
years = dict(year = lambda year: range(year.min(), year.max() + 1))

# apply the complete function per group and fill the nulls with 0

df.complete(years, by = 'ID', sort = True).fillna(0, downcast = 'infer')
 
  ID  year  number
0  A  2017       1
1  A  2018       0
2  A  2019       1
3  B  2017       1
4  B  2018       1
5  C  2016       1
6  C  2017       0
7  C  2018       0
8  C  2019       1

However, the by is primarily for convenience; there can be scenarios where it may be more efficient to do a bit more work; similar to @Alollz's solution:


# get the mapping for the year for the entire dataframe
years = dict(year =  range(df.year.min(), df.year.max() + 1))

# create a groupby
group = df.groupby('ID').year

#  assign the max and min years to the dataframe
(df.assign(year_max = group.transform('max'), 
           year_min = group.transform('min'))
     # run complete on the entire dataframe, without `by`
    # note that ID, year_min, year_max are grouped together
    # think of it as a DataFrame of just these three columns
    # combined with years .. we are not modifying these three columns
    # only the years 
   .complete(years, ('ID', 'year_min', 'year_max'))
    # filter rows where year is between max and min
   .loc[lambda df: df.year.between(df.year_min, df.year_max), 
        df.columns]
    # sort the values and fillna
   .sort_values([*df], ignore_index = True)
   .fillna(0, downcast = 'infer')
)
 
  ID  year  number
0  A  2017       1
1  A  2018       0
2  A  2019       1
3  B  2017       1
4  B  2018       1
5  C  2016       1
6  C  2017       0
7  C  2018       0
8  C  2019       1

Using @Allolz's sample data:

N = 50000
df = pd.DataFrame({'ID': np.repeat(range(N), 2),
                   'year': np.tile([2010,2018], N),
                   'number': 1})

def complete_sam(df):
    years = dict(year =  range(df.year.min(), df.year.max() + 1))
    group = df.groupby('ID').year
    outcome = (df.assign(year_max = group.transform('max'),
                         year_min = group.transform('min'))
                 .complete(years, ('ID', 'year_min', 'year_max'))
                 .loc[lambda df: df.year.between(df.year_min, 
                                                 df.year_max),
                     df.columns]
                 .sort_values([*df], ignore_index = True)
                 .fillna(0)
              )
    return outcome

#@Scott Boston's Answer
def SB(df):
    idx = df.groupby('ID')['year'].apply(lambda x: pd.Series(np.arange(x.iloc[0], x.iloc[-1]+1))).reset_index()
    df = df.set_index(['ID','year']).reindex(pd.MultiIndex.from_arrays([idx['ID'], idx['year']]), fill_value=0).reset_index()
    return df

#@Alollz's answer
def Alollz(df):
    idx = pd.MultiIndex.from_product([np.unique(df['ID']), 
                                      np.arange(df['year'].min(), df['year'].max()+1)],
                                     names=['ID', 'year'])
   
    df_b = pd.DataFrame({'number': 0}, index=idx)
    df_b.update(df.set_index(['ID', 'year']))
    
    m = (df_b.groupby(level=0)['number'].cummax().eq(1) 
         & df_b[::-1].groupby(level=0)['number'].cummax().eq(1))
    
    return df_b.loc[m].reset_index()

In [310]: Alollz(df).equals(complete_sam(df))
Out[310]: True

In [311]: %timeit complete_sam(df)
268 ms ± 24.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [312]: %timeit Alollz(df)
1.84 s ± 58.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [316]: SB(df).eq(complete_sam(df)).all().all()
Out[316]: True

In [317]: %timeit SB(df)
6.13 s ± 87.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 1

ALollz

Reputation: 59579

Here is a method that avoids any slow applies with a lambda. It's a memory inefficient solution in the sense that we create the base DataFrame which is a cross-product of all IDs and the year range in your DataFrame. After an update we can efficiently slice this down to the periods you need with a Boolean Mask. The mask is created from a cummax check in the forward and reverse directions.

If most IDs span the same general range of years there won't be too much waste in terms of creating the base DataFrame from a product. If you want even more performance there are many posts about more efficient ways to do a cross-product

def Alollz(df):
    idx = pd.MultiIndex.from_product([np.unique(df['ID']), 
                                      np.arange(df['year'].min(), df['year'].max()+1)],
                                     names=['ID', 'year'])
   
    df_b = pd.DataFrame({'number': 0}, index=idx)
    df_b.update(df.set_index(['ID', 'year']))
    
    m = (df_b.groupby(level=0)['number'].cummax().eq(1) 
         & df_b[::-1].groupby(level=0)['number'].cummax().eq(1))
    
    return df_b.loc[m].reset_index()

Alollz(df)

  ID  year  number
0  A  2017     1.0
1  A  2018     0.0
2  A  2019     1.0
3  B  2017     1.0
4  B  2018     1.0
5  C  2016     1.0
6  C  2017     0.0
7  C  2018     0.0
8  C  2019     1.0

That's certainly a lot more code than some of the other proposals. But to see where it really shines let's create some dummy data with 50K IDs (Here I'll let the date ranges be identical for all just for simplicity of creating test data).

N = 50000
df = pd.DataFrame({'ID': np.repeat(range(N), 2),
                   'year': np.tile([2010,2018], N),
                   'number': 1})

#@Scott Boston's Answer
def SB(df):
    idx = df.groupby('ID')['year'].apply(lambda x: pd.Series(np.arange(x.iloc[0], x.iloc[-1]+1))).reset_index()
    df = df.set_index(['ID','year']).reindex(pd.MultiIndex.from_arrays([idx['ID'], idx['year']]), fill_value=0).reset_index()
    return df

# Make sure they give the same output:
(Alollz(df) == SB(df)).all().all()
#True

%timeit Alollz(df)
#1.9 s ± 73.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit SB(df)
#10.8 s ± 539 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So that's about 5x faster, which is a pretty big deal with things are taking on the order of seconds.

Upvotes: 5

Narendra Prasath

Reputation: 1531

You can try using date_range and pd.merge:

g = df.groupby("ID")["year"].agg({"min":"min","max":"max"}).reset_index()
id_years = pd.DataFrame(list(g.apply(lambda row: list(row["ID"]) + 
                    list(pd.date_range(start=f"01/01/{row['min']}", \
                    end=f"01/01/{row['max']+1}",freq='12M').year), axis=1))).melt(0).dropna()[[0,"value"]]

id_years.loc[:,"value"] = id_years["value"].astype(int)
id_years = id_years.rename(columns = {0:"ID","value":'year'})
id_years = id_years.sort_values(["ID","year"]).reset_index(drop=True)

## Merge two dataframe
output_df = pd.merge(id_years, df, on=["ID","year"], how="left").fillna(0)
output_df.loc[:,"number"] = output_df["number"].astype(int)
output_df

output:

    ID  year    number
0   A   2017    1
1   A   2018    0
2   A   2019    1
3   B   2017    1
4   B   2018    1
5   C   2016    1
6   C   2017    0
7   C   2018    0
8   C   2019    1

Upvotes: 2

Scott Boston

Reputation: 153550

A slightly faster approach rather than using explode is to use pd.Series constructor. And you can use .iloc if years are already sorted from earliest to latest.

idx = df.groupby('ID')['year'].apply(lambda x: pd.Series(np.arange(x.iloc[0], x.iloc[-1]+1))).reset_index()
df.set_index(['ID','year']).reindex(pd.MultiIndex.from_arrays([idx['ID'], idx['year']]), fill_value=0).reset_index()

Output:

  ID  year  number
0  A  2017       1
1  A  2018       0
2  A  2019       1
3  B  2017       1
4  B  2018       1
5  C  2016       1
6  C  2017       0
7  C  2018       0
8  C  2019       1

Upvotes: 20

anky

Reputation: 75150

Here is another approach with reindex

u = df.groupby('ID')['year'].apply(lambda x: range(x.min(),x.max()+1)).explode()

out = (df.set_index(['ID','year']).reindex(u.reset_index().to_numpy(),fill_value=0)
         .reset_index())

  ID  year  number
0  A  2017       1
1  A  2018       0
2  A  2019       1
3  B  2017       1
4  B  2018       1
5  C  2016       1
6  C  2017       0
7  C  2018       0
8  C  2019       1

Upvotes: 11

Chris

Reputation: 16172

t = df.groupby('ID')['year'].agg(['min','max']).reset_index()
t['missing'] = t.transform(lambda x: [y for y in range(x['min'], x['max']+1) if y not in x.values], axis=1)
t = t[['ID','missing']].explode('missing').dropna()
t['number'] = 0
t.columns = ['ID','year','number']
pd.concat([df,t]).sort_values(by=['ID','year'])

Output

    ID  year    number
0   A   2017    1
0   A   2018    0
1   A   2019    1
2   B   2017    1
3   B   2018    1
4   C   2016    1
2   C   2017    0
2   C   2018    0
5   C   2019    1

Upvotes: 6

Sebastien D

Reputation: 4482

Here is an approach:

letter_keys = df.ID.unique()
data = df.values
missing_records = []
for letter in letter_keys:
    print(letter)
    years = [x[1] for x in data if x[0] == letter]
    min_year = min(years)
    max_year = max(years)
    current_year = min_year
    while current_year<max_year:
        if current_year not in years:
            missing_records.append([letter, current_year,0])
            print('missing', current_year)
        current_year +=1

new_df = df.append(pd.DataFrame(missing_records, columns = df.columns)).sort_values(['ID','year'])

Output

| ID   |   year |   number |
|:-----|-------:|---------:|
| A    |   2017 |        1 |
| A    |   2018 |        0 |
| A    |   2019 |        1 |
| B    |   2017 |        1 |
| B    |   2018 |        1 |
| C    |   2016 |        1 |
| C    |   2017 |        0 |
| C    |   2018 |        0 |
| C    |   2019 |        1 |

Upvotes: 4

dokteurwho

Reputation: 371

This would work, but will create a '2019' entry for 'B':

df.pivot(index='ID', columns='year', values='number').fillna(0).stack().to_frame('number')

returns:

    number
ID  year    
A   2016    0.0
2017    1.0
2018    0.0
2019    1.0
B   2016    0.0
2017    1.0
2018    1.0
2019    0.0
C   2016    1.0
2017    0.0
2018    0.0
2019    1.0

Upvotes: 1

What is the most efficient way to fill missing values in this data frame?

Answers (8)

Related Questions