Joe
Joe

Reputation: 95

Adding a mode value for nan's using groupBy on another column

I have some nan cells in the year column. I thought it'll be better to set the Mode of the year column grouped-By Industry type.

df
ID  Name     Industry    Year   Employees Expenses   Profit
1   E-Zim    Health      2019   320       1,130,700  8553827
2   Daltfase Software    NaN    78        804,035    13212508
3   Hotlane  Government  2012   87        1,044,375  8701897
4   Latho    Health      NaN    103       4,631,808  10727561
5   Lambam   IT Services 2015   210       4,374,841  4193069
6   Quozap   Health      2008   21        4,626,275  8179177
7   Tampware Health      2008   13        2,127,984  3259485

For mode values, i did:

df_mode_year = df.groupby('Industry')['Year'].apply(lambda x: x.mode().iloc[0])
df_mode_year

Industry
Government             2012
Health                 2008
IT Services            2015
Software

And then to modify my df, i tried

  1. df['Year'].fillna(df_mode_year)
  2. df['Year'] = df['Year'].fillna(df_mode_year[df['Industry']=='Health'])

But none of these two are affecting the final df .

Expected Output:

df
ID  Name     Industry    Year   Employees Expenses   Profit
1   E-Zim    Health      2019   320       1,130,700  8553827
2   Daltfase Software    NaN    78        804,035    13212508
3   Hotlane  Government  2012   87        1,044,375  8701897
4   Latho    Health      2008   103       4,631,808  10727561
5   Lambam   IT Services 2012   210       4,374,841  4193069
6   Quozap   Health      2008   21        4,626,275  8179177
7   Tampware Health      2008   13        2,127,984  3259485

What am i doing wrong? Thanks a lot.

Upvotes: 0

Views: 77

Answers (2)

RichieV
RichieV

Reputation: 5183

You can use .transform() instead of .apply(), which will return a series of the same length as df, with the mode of each group

mode = df.groupby('Industry').Year
mode = mode.transform(lambda x: x.mode().squeeze())
df.update(mode, overwrite=False)

.update() operations are always done inplace and return None


Edit

The code in the first section works fine if all Industries have at least one year, but it seems .transform(lambda x: x.mode().squeeze()) raises an error when it encounters a group with only NaN as values. If this is a possibility in the data, then the easiest is to take your approach to summarize mode by Industry:

mode = df.groupby('Industry').Year.apply(lambda x: x.mode().squeeze())

The problem now is that .squeeze() returns an empty Series and .iloc[0] (as in your original code) raises an exception.

>>> print(mode.apply(type))

Industry
Government                 <class 'numpy.float64'>
Health                     <class 'numpy.float64'>
IT Services                <class 'numpy.float64'>
Software       <class 'pandas.core.series.Series'>
Name: Year, dtype: object

>>> mode = df.groupby('Industry').Year.apply(lambda x: x.mode().iloc[0])
Traceback ...
IndexError: single positional indexer is out-of-bounds

So now we need to fix that Software value. Since it is an iterable, it cannot be replaced, but it can be iterated upon:

mode = mode.explode()

now mode is just like we need it

>>> print(mode)
Industry
Government     2012
Health         2008
IT Services    2015
Software        NaN
Name: Year, dtype: object

And for the final trick of the evening:

df = df.merge(mode, on='Industry') # default suffixes: ['_x', '_y']
mode = df.pop('Year_y').rename('Year')
df.rename({'Year_x': 'Year'}, axis=1, inplace=True)
df.update(mode, overwrite=False)

Final Output

   ID      Name     Industry  Year  Employees   Expenses    Profit
0   1     E-Zim       Health  2019        320  1,130,700   8553827
1   4     Latho       Health  2008        103  4,631,808  10727561
2   6    Quozap       Health  2008         21  4,626,275   8179177
3   7  Tampware       Health  2008         13  2,127,984   3259485
4   2  Daltfase     Software   NaN         78    804,035  13212508
5   3   Hotlane   Government  2012         87  1,044,375   8701897
6   5    Lambam  IT Services  2015        210  4,374,841   4193069

Notice how Software remains NaN and health updated only NaNs

Upvotes: 1

hyang23
hyang23

Reputation: 1

I would try pivot and melt. (You may need to convert df_mode_year to a map first)

df1 = df.pivot(index = 'ID', columns = 'Industry', values = 'Year').reset_index(drop = True)
df1 = df1.fillna(df_mode_year)
df1 = df1.melt(id_vars = 'ID', var_name='Industry', value_name='Year')

You can then concat this df1 with other columns from df to get the desired table.

Upvotes: 0

Related Questions