Reputation: 1052
Dataframe with 3 columns:
FLAG CLASS CATEGORY
yes 'Sci' 'Alpha'
yes 'Sci' 'undefined'
yes 'math' 'Beta'
yes 'math' 'undefined'
yes 'eng' 'Gamma'
yes 'math' 'Beta'
yes 'eng' 'Gamma'
yes 'eng' 'Omega'
yes 'eng' 'Omega'
yes 'eng' 'undefined'
yes 'Geog' 'Lambda'
yes 'Art' 'undefined'
yes 'Art' 'undefined'
yes 'Art' 'undefined'
I want to fill up the 'undefined' values in the column CATEGORY with the other category value (if any) that the class has. E.g. The Science class will fill up its empty category with 'Alpha', The 'math' class will fill up its 'undefined' category with 'Beta'.
In the case there are 2 or more categories to consider, leave as is. E.g. The english class 'eng' has two categories 'Gamma' and 'Omega', so the category 'undefined' for the class English will be left as 'undefined'
If all the categories for a class are 'undefined', leave as 'undefined'.
Result
FLAG CLASS CATEGORY
yes 'Sci' 'Alpha'
yes 'Sci' 'Alpha'
yes 'math' 'Beta'
yes 'math' 'Beta'
yes 'eng' 'Gamma'
yes 'math' 'Beta'
yes 'eng' 'Gamma'
yes 'eng' 'Gamma'
yes 'eng' 'Omega'
yes 'eng' 'Omega'
yes 'eng' 'undefined'
yes 'Geog' 'Lambda'
yes 'Art' 'undefined'
yes 'Art' 'undefined'
yes 'Art' 'undefined'
IT NEEDS TO GENERALIZE. I HAVE MANY CLASSES IN THE CLASS COLUMN and cannot afford to encode 'Sci' or 'eng'.
I have been trying this with multiple np.wheres but had no luck.
Upvotes: 0
Views: 117
Reputation: 25239
Edit:
I add another solution using isin
to filter out on valid class
for updating both not undefined
and undefined
. Then, updating this exact slice of df
.
Steps:
Creating m
as the series of CLASS
has CATEGORY
as undifined
and unique not undefined
values. Using isin
to select qualified rows and where
to turn undefined
to NaN
. Finally, Groupby
by CLASS
on these row, ffill
, bfill
per group to fill NaN
and assign back to df
m = df.query('CATEGORY!="undefined"').drop_duplicates().CLASS.drop_duplicates(keep=False)
df[df.CLASS.isin(m)] = df[df.CLASS.isin(m)].where(df!='undefined').groupby('CLASS').ffill().bfill()
This solution looks cleaner, but I don't know whether it is slower than original solution since using groupby
Original:
My solution constructs 'not undefined'
from 'undefined'
mapped by unique 'not undefined'
values:
m = df.query('CATEGORY != "undefined"').drop_duplicates().CLASS.drop_duplicates(keep=False)
t = df.query('CATEGORY == "undefined"').CLASS.map(df.loc[m.index].set_index('CLASS').CATEGORY)
df['CATEGORY'].update(t)
Out[553]:
FLAG CLASS CATEGORY
0 yes Sci Alpha
1 yes Sci Alpha
2 yes math Beta
3 yes math Beta
4 yes eng Gamma
5 yes math Beta
6 yes eng Gamma
7 yes eng Omega
8 yes eng Omega
9 yes eng undefined
10 yes Geog Lambda
11 yes Art undefined
12 yes Art undefined
13 yes Art undefined
Upvotes: 1
Reputation: 323226
I will using ffill
and bffil
within groupby
s=df.CATEGORY.mask(df.CATEGORY.eq('undefined'))
s2=s.groupby(df['CLASS']).transform('nunique')
df.loc[s2.eq(1)&s.isnull(),'CATEGORY']=s.groupby(df.CLASS).apply(lambda x : x.ffill().bfill())
df
Out[388]:
FLAG CLASS CATEGORY
0 yes Sci Alpha
1 yes Sci Alpha
2 yes math Beta
3 yes math Beta
4 yes eng Gamma
5 yes math Beta
6 yes eng Gamma
7 yes eng Omega
8 yes eng Omega
9 yes eng undefined
10 yes Geog Lambda
11 yes Art undefined
12 yes Art undefined
13 yes Art undefined
Upvotes: 2
Reputation: 2032
Try below:
df['CATEGORY'] = df.replace('undefined', np.nan, regex=True).groupby('CLASS')['CATEGORY'].apply(lambda x: x.fillna(x.mode()[0]) if not x.isna().all() else x).replace(np.nan, "\'undefined\'")
Upvotes: 1
Reputation: 31993
you can do by using boolian indesing
df[(df['CLASS']=='Sci'& df['CATEGORY']=='undefined','CATEGORY')]='Alpha'
df[(df['CLASS']=='math'& df['CATEGORY']=='undefined','CATEGORY')]='Beta'
Upvotes: 0