Reputation: 95
I have some nan cells in the year column. I thought it'll be better to set the Mode of the year column grouped-By Industry type.
df
ID Name Industry Year Employees Expenses Profit
1 E-Zim Health 2019 320 1,130,700 8553827
2 Daltfase Software NaN 78 804,035 13212508
3 Hotlane Government 2012 87 1,044,375 8701897
4 Latho Health NaN 103 4,631,808 10727561
5 Lambam IT Services 2015 210 4,374,841 4193069
6 Quozap Health 2008 21 4,626,275 8179177
7 Tampware Health 2008 13 2,127,984 3259485
For mode values, i did:
df_mode_year = df.groupby('Industry')['Year'].apply(lambda x: x.mode().iloc[0])
df_mode_year
Industry
Government 2012
Health 2008
IT Services 2015
Software
And then to modify my df, i tried
But none of these two are affecting the final df .
Expected Output:
df
ID Name Industry Year Employees Expenses Profit
1 E-Zim Health 2019 320 1,130,700 8553827
2 Daltfase Software NaN 78 804,035 13212508
3 Hotlane Government 2012 87 1,044,375 8701897
4 Latho Health 2008 103 4,631,808 10727561
5 Lambam IT Services 2012 210 4,374,841 4193069
6 Quozap Health 2008 21 4,626,275 8179177
7 Tampware Health 2008 13 2,127,984 3259485
What am i doing wrong? Thanks a lot.
Upvotes: 0
Views: 77
Reputation: 5183
You can use .transform()
instead of .apply()
, which will return a series of the same length as df, with the mode of each group
mode = df.groupby('Industry').Year
mode = mode.transform(lambda x: x.mode().squeeze())
df.update(mode, overwrite=False)
.update()
operations are always done inplace and return None
Edit
The code in the first section works fine if all Industries have at least one year, but it seems .transform(lambda x: x.mode().squeeze())
raises an error when it encounters a group with only NaN
as values. If this is a possibility in the data, then the easiest is to take your approach to summarize mode by Industry:
mode = df.groupby('Industry').Year.apply(lambda x: x.mode().squeeze())
The problem now is that .squeeze()
returns an empty Series and .iloc[0]
(as in your original code) raises an exception.
>>> print(mode.apply(type))
Industry
Government <class 'numpy.float64'>
Health <class 'numpy.float64'>
IT Services <class 'numpy.float64'>
Software <class 'pandas.core.series.Series'>
Name: Year, dtype: object
>>> mode = df.groupby('Industry').Year.apply(lambda x: x.mode().iloc[0])
Traceback ...
IndexError: single positional indexer is out-of-bounds
So now we need to fix that Software value. Since it is an iterable, it cannot be replaced, but it can be iterated upon:
mode = mode.explode()
now mode is just like we need it
>>> print(mode)
Industry
Government 2012
Health 2008
IT Services 2015
Software NaN
Name: Year, dtype: object
And for the final trick of the evening:
df = df.merge(mode, on='Industry') # default suffixes: ['_x', '_y']
mode = df.pop('Year_y').rename('Year')
df.rename({'Year_x': 'Year'}, axis=1, inplace=True)
df.update(mode, overwrite=False)
Final Output
ID Name Industry Year Employees Expenses Profit
0 1 E-Zim Health 2019 320 1,130,700 8553827
1 4 Latho Health 2008 103 4,631,808 10727561
2 6 Quozap Health 2008 21 4,626,275 8179177
3 7 Tampware Health 2008 13 2,127,984 3259485
4 2 Daltfase Software NaN 78 804,035 13212508
5 3 Hotlane Government 2012 87 1,044,375 8701897
6 5 Lambam IT Services 2015 210 4,374,841 4193069
Notice how Software remains NaN
and health updated only NaN
s
Upvotes: 1
Reputation: 1
I would try pivot and melt. (You may need to convert df_mode_year to a map first)
df1 = df.pivot(index = 'ID', columns = 'Industry', values = 'Year').reset_index(drop = True)
df1 = df1.fillna(df_mode_year)
df1 = df1.melt(id_vars = 'ID', var_name='Industry', value_name='Year')
You can then concat this df1 with other columns from df to get the desired table.
Upvotes: 0