Reputation: 7179
Similar to this question How to add an empty column to a dataframe?, I am interested in knowing the best way to add a column of empty lists to a DataFrame.
What I am trying to do is basically initialize a column and as I iterate over the rows to process some of them, then add a filled list in this new column to replace the initialized value.
For example, if below is my initial DataFrame:
df = pd.DataFrame(d = {'a': [1,2,3], 'b': [5,6,7]}) # Sample DataFrame
>>> df
a b
0 1 5
1 2 6
2 3 7
Then I want to ultimately end up with something like this, where each row has been processed separately (sample results shown):
>>> df
a b c
0 1 5 [5, 6]
1 2 6 [9, 0]
2 3 7 [1, 2, 3]
Of course, if I try to initialize like df['e'] = []
as I would with any other constant, it thinks I am trying to add a sequence of items with length 0, and hence fails.
If I try initializing a new column as None
or NaN
, I run in to the following issues when trying to assign a list to a location.
df['d'] = None
>>> df
a b d
0 1 5 None
1 2 6 None
2 3 7 None
Issue 1 (it would be perfect if I can get this approach to work! Maybe something trivial I am missing):
>>> df.loc[0,'d'] = [1,3]
...
ValueError: Must have equal len keys and value when setting with an iterable
Issue 2 (this one works, but not without a warning because it is not guaranteed to work as intended):
>>> df['d'][0] = [1,3]
C:\Python27\Scripts\ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
Hence I resort to initializing with empty lists and extending them as needed. There are a couple of methods I can think of to initialize this way, but is there a more straightforward way?
Method 1:
df['empty_lists1'] = [list() for x in range(len(df.index))]
>>> df
a b empty_lists1
0 1 5 []
1 2 6 []
2 3 7 []
Method 2:
df['empty_lists2'] = df.apply(lambda x: [], axis=1)
>>> df
a b empty_lists1 empty_lists2
0 1 5 [] []
1 2 6 [] []
2 3 7 [] []
Summary of questions:
Is there any minor syntax change that can be addressed in Issue 1 that can allow a list to be assigned to a None
/NaN
initialized field?
If not, then what is the best way to initialize a new column with empty lists?
Upvotes: 63
Views: 71924
Reputation: 31
Here is a tricky but straightforward one:
import ast
df['empty_lists1'] = "[]"
df["empty_lists1"] = df["empty_lists1"].apply(lambda x: ast.literal_eval(x))
Upvotes: 0
Reputation: 6114
EDIT: the commenters caught the bug in my answer
s = pd.Series([[]] * 3)
s.iloc[0].append(1) #adding an item only to the first element
>s # unintended consequences:
0 [1]
1 [1]
2 [1]
So, the correct solution is
s = pd.Series([[] for i in range(3)])
s.iloc[0].append(1)
>s
0 [1]
1 []
2 []
OLD:
I timed all the three methods in the accepted answer, the fastest one took 216 ms on my machine. However, this took only 28 ms:
df['empty4'] = [[]] * len(df)
Note: Similarly, df['e5'] = [set()] * len(df)
also took 28ms.
Upvotes: 12
Reputation: 402523
map
and apply
Obligatory disclaimer: avoid using lists in pandas columns where possible, list columns are slow to work with because they are objects and those are inherently hard to vectorize.
With that out of the way, here are the canonical methods of introducing a column of empty lists:
# List comprehension
df['c'] = [[] for _ in range(df.shape[0])]
df
a b c
0 1 5 []
1 2 6 []
2 3 7 []
There's also these shorthands involving apply
and map
:
from collections import defaultdict
# map any column with defaultdict
df['c'] = df.iloc[:,0].map(defaultdict(list))
# same as,
df['c'] = df.iloc[:,0].map(lambda _: [])
# apply with defaultdict
df['c'] = df.apply(defaultdict(list), axis=1)
# same as,
df['c'] = df.apply(lambda _: [], axis=1)
df
a b c
0 1 5 []
1 2 6 []
2 3 7 []
Some folks believe multiplying an empty list is the way to go, unfortunately this is wrong and will usually lead to hard-to-debug issues. Here's an MVP:
# WRONG
df['c'] = [[]] * len(df)
df.at[0, 'c'].append('abc')
df.at[1, 'c'].append('def')
df
a b c
0 1 5 [abc, def]
1 2 6 [abc, def]
2 3 7 [abc, def]
# RIGHT
df['c'] = [[] for _ in range(df.shape[0])]
df.at[0, 'c'].append('abc')
df.at[1, 'c'].append('def')
df
a b c
0 1 5 [abc]
1 2 6 [def]
2 3 7 []
In the first case, a single empty list is created and its reference is replicated across all the rows, so you see updates to one reflected to all of them. In the latter case each row is assigned its own empty list, so this is not a concern.
Upvotes: 5
Reputation: 12108
One more way is to use np.empty
:
df['empty_list'] = np.empty((len(df), 0)).tolist()
You could also knock off .index
in your "Method 1" when trying to find len
of df
.
df['empty_list'] = [[] for _ in range(len(df))]
Turns out, np.empty
is faster...
In [1]: import pandas as pd
In [2]: df = pd.DataFrame(pd.np.random.rand(1000000, 5))
In [3]: timeit df['empty1'] = pd.np.empty((len(df), 0)).tolist()
10 loops, best of 3: 127 ms per loop
In [4]: timeit df['empty2'] = [[] for _ in range(len(df))]
10 loops, best of 3: 193 ms per loop
In [5]: timeit df['empty3'] = df.apply(lambda x: [], axis=1)
1 loops, best of 3: 5.89 s per loop
Upvotes: 90