Learn
Learn

Reputation: 548

Python loop to check rows in a column is null then replace

I am attempting to get a Non null value for the LastName, but I am receiving an error. How can I resolve this? p.s I have 20 millions rows.

dataframe:

FirstName   Middle  LastName
Tom          Ju     NaN
Kity         NaN    Rob

my attemp:

for row in df:
    if row['LastName'].isnull() == True:
        row['real_lastName'] =  row['Middle']
    else:
        row['real_lastName'] =  row['LastName'] 

i have the following error

TypeError: string indices must be integers

Upvotes: 1

Views: 13220

Answers (2)

Joe
Joe

Reputation: 12417

Another option:

df["real_lastName"] = df['middle'].replace(np.NaN, '') + df['last_name'].replace(np.NaN, '')

Upvotes: 0

jezrael
jezrael

Reputation: 862691

Use numpy.where:

df['real_lastName'] = np.where(df['LastName'].isnull(), df['Middle'], df['LastName'] )

print (df)
  FirstName Middle LastName real_lastName
0       Tom     Ju      NaN            Ju
1      Kity    NaN      Rob           Rob

Another possible solution is use fillna or combine_first:

df['real_lastName'] = df['LastName'].fillna(df['Middle'])

df['real_lastName'] = df['LastName'].combine_first(df['Middle'])

Performance is similar:

#[200000 rows x 4 columns]
df = pd.concat([df] * 100000, ignore_index=True)
In [41]: %timeit df['real_lastName'] = np.where(df['LastName'].isnull(), df['Middle'], df['LastName'] )
13.3 ms ± 51.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [42]: %timeit df['real_lastName'] = df['LastName'].fillna(df['Middle'])
16.2 ms ± 58.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [43]: %timeit df['real_lastName'] = df['LastName'].combine_first(df['Middle'])
13 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 4

Related Questions