Reputation: 4249
I'm trying to replace some NaN values in my data with an empty list []. However the list is represented as a str and doesn't allow me to properly apply the len() function. is there anyway to replace a NaN value with an actual empty list in pandas?
In [28]: d = pd.DataFrame({'x' : [[1,2,3], [1,2], np.NaN, np.NaN], 'y' : [1,2,3,4]})
In [29]: d
Out[29]:
x y
0 [1, 2, 3] 1
1 [1, 2] 2
2 NaN 3
3 NaN 4
In [32]: d.x.replace(np.NaN, '[]', inplace=True)
In [33]: d
Out[33]:
x y
0 [1, 2, 3] 1
1 [1, 2] 2
2 [] 3
3 [] 4
In [34]: d.x.apply(len)
Out[34]:
0 3
1 2
2 2
3 2
Name: x, dtype: int64
Upvotes: 39
Views: 34171
Reputation: 3906
import pandas as pd
import numpy as np
data = {'column1': [[1, 2], [2, 3], np.nan, [4, 5], np.nan],
'column2': [np.nan, "Hi", "Hello", np.nan, "H"]}
df = pd.DataFrame(data)
def replace_none_with_empty_list(x):
if x is np.nan:
return []
else:
return x
df = df.applymap(replace_none_with_empty_list)
print(df)
wherever NaN is there, this will remove with empty array.else retuns the same value
column1 column2
0 [1, 2] []
1 [2, 3] Hi
2 [] Hello
3 [4, 5] []
4 [] H
Upvotes: 0
Reputation: 129
To extend the accepted answer, apply calls can be particularly expensive - the same task can be accomplished without it by constructing a numpy array from scratch.
isna = df['x'].isna()
df.loc[isna, 'x'] = pd.Series([[]] * isna.sum()).values
A quick timing comparison:
def empty_assign_1(s):
s[s.isna()].apply(lambda x: [])
def empty_assign_2(s):
[[]] * s.isna().sum()
series = pd.Series(np.random.choice([1, 2, np.nan], 1000000))
%timeit empty_assign_1(series)
>>> 61 ms ± 964 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit empty_assign_2(series)
>>> 2.17 ms ± 70.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Nearly 10 times faster!
EDIT: Fixed a bug pointed out by @valentin
You have to be somewhat careful with data types when performing assignment in this case. In the example above, the test series is float, however, adding []
elements coerces the entire series to object. Pandas will handle that for you if you do something like
idx = series.isna()
series[isna] = series[isna].apply(lambda x: [])
Because the output of apply is itself a series. You can test live performance with assignment overhead like so (I've added a string value so the series with be an object, you could instead use a number as the replacement value rather than an empty list to avoid coercion).
def empty_assign_1(s):
idx = s.isna()
s[idx] = s[idx].apply(lambda x: [])
def empty_assign_2(s):
idx = s.isna()
s.loc[idx] = [[]] * idx.sum()
series = pd.Series(np.random.choice([1, 2, np.nan, '2'], 1000000))
%timeit empty_assign_1(series.copy())
>>> 45.1 ms ± 386 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit empty_assign_2(series.copy())
>>> 24 ms ± 393 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
About 4 ms of that is related to the copy, 10x to 2x, still pretty great.
Upvotes: 12
Reputation: 1466
You can also use a list comprehension for this:
d['x'] = [ [] if x is np.NaN else x for x in d['x'] ]
Upvotes: 9
Reputation: 394071
This works using isnull
and loc
to mask the series:
In [90]:
d.loc[d.isnull()] = d.loc[d.isnull()].apply(lambda x: [])
d
Out[90]:
0 [1, 2, 3]
1 [1, 2]
2 []
3 []
dtype: object
In [91]:
d.apply(len)
Out[91]:
0 3
1 2
2 0
3 0
dtype: int64
You have to do this using apply
in order for the list object to not be interpreted as an array to assign back to the df which will try to align the shape back to the original series
EDIT
Using your updated sample the following works:
In [100]:
d.loc[d['x'].isnull(),['x']] = d.loc[d['x'].isnull(),'x'].apply(lambda x: [])
d
Out[100]:
x y
0 [1, 2, 3] 1
1 [1, 2] 2
2 [] 3
3 [] 4
In [102]:
d['x'].apply(len)
Out[102]:
0 3
1 2
2 0
3 0
Name: x, dtype: int64
Upvotes: 45