Reputation: 4149

Fill missing values based on condition in duplicated column

I have Pandas dataframe with two columns, such as:

df = ID state
      255 NJ
      255 NaN
      266 CT
      266 CT
      277 NaN
      277 NY
      277 NaN

I want to fill missing values in state.

Desired output is the following:

df = ID state
      255 NJ
      255 NJ
      266 CT
      266 CT
      277 NY
      277 NY
      277 NY

How can I overcome this? Trying but without success. Tried, numpy.where creating masks but getting this error operands could not be broadcast together with shapes (26229,) (2053,) () and many more. Any help is appreciated.

Upvotes: 3

Answers (4)

jezrael

Reputation: 862581

Use DataFrame.sort_values with GroupBy.ffill:

df['state'] = df.sort_values('state').groupby('ID')['state'].ffill()
print (df)
    ID state
0  255    NJ
1  255    NJ
2  266    CT
3  266    CT
4  277    NY
5  277    NY
6  277    NY

If necessary filling multiple columns use:

cols = ['state', ...]
df.loc[:, cols] = df.sort_values('state').groupby('ID')[cols].ffill()

Upvotes: 2

Quang Hoang

Reputation: 150735

IIUC, each ID has a unique state, so:

df['state'] = df.groupby('ID')['state'].transform('first')

output:

    ID state
0  255    NJ
1  255    NJ
2  266    CT
3  266    CT
4  277    NY
5  277    NY
6  277    NY

Upvotes: 2

BENY

Reputation: 323226

Using groupby with ffill +bfill

df.state=df.groupby('ID').state.apply(lambda x : x.ffill().bfill())
df
Out[907]: 
    ID state
0  255    NJ
1  255    NJ
2  266    CT
3  266    CT
4  277    NY
5  277    NY
6  277    NY

Upvotes: 1

tawab_shakeel

Reputation: 3739

first sort_values and then use ffill using groupby

df.sort_values(by=['ID','state'],ascending=[True,True],inplace=True)
df['state'] = df.groupby(['ID']).transform(pd.Series.ffill)

Upvotes: 1

Fill missing values based on condition in duplicated column

Answers (4)

Related Questions