ysearka
ysearka

Reputation: 3855

set list as value in a column of a pandas dataframe

Let's say I have a dataframe df and I would like to create a new column filled with 0, I use:

df['new_col'] = 0

This far, no problem. But if the value I want to use is a list, it doesn't work:

df['new_col'] = my_list

ValueError: Length of values does not match length of index

I understand why this doesn't work (pandas is trying to assign one value of the list per cell of the column), but how can we avoid this behavior? (if it isn't clear I would like every cell of my new column to contain the same predefined list)

Note: I also tried: df.assign(new_col = my_list), same problem

Upvotes: 33

Views: 56639

Answers (3)

lisovskey
lisovskey

Reputation: 133

You can use DataFrame.apply:

In [1]:
df = pd.DataFrame([1, 2, 3], columns=['numbers'])
my_list = ['foo', 'bar']
df['lists'] = df.apply(lambda _: my_list, axis=1)
df

Out[1]:
   numbers       lists
0        1  [foo, bar]
1        2  [foo, bar]
2        3  [foo, bar]

Again, be aware that my_list is mutable and shared across the whole dataframe. To avoid that you can make a copy for each row:

df['lists'] = df.apply(lambda _: my_list.copy(), axis=1)

Upvotes: 2

Mr_and_Mrs_D
Mr_and_Mrs_D

Reputation: 34016

Note that the accepted answer may lead to surprising behavior if you want to modify those lists:

df = pd.DataFrame([1, 2, 3], columns=['a'])
df['lists'] = [[]]* len(df)
df
   a lists
0  1    []
1  2    []
2  3    []
df.loc[df.a == 1, 'lists'][0].append('1')
df
   a lists
0  1   [1]
1  2   [1]
2  3   [1]
# oops

To avoid this you must initialize the lists column with a different list instance per row:

df['lists'] = [[] for r in range(len(df))] # note you can't use a generator
df.loc[df.a == 1, 'lists'][0].append('1')
df
   a lists
0  1   [1]
1  2    []
2  3    []

Don't be fooled by the display there, that 1 is still a string:

df.loc[df.a == 1, 'lists'][0]
['1']

Upvotes: 14

EdChum
EdChum

Reputation: 393943

You'd have to do:

df['new_col'] = [my_list] * len(df)

Example:

In [13]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
df

Out[13]:
          a         b         c
0 -0.010414  1.859791  0.184692
1 -0.818050 -0.287306 -1.390080
2 -0.054434  0.106212  1.542137
3 -0.226433  0.390355  0.437592
4 -0.204653 -2.388690  0.106218

In [17]:
df['b'] = [[234]] * len(df)
df

Out[17]:
          a      b         c
0 -0.010414  [234]  0.184692
1 -0.818050  [234] -1.390080
2 -0.054434  [234]  1.542137
3 -0.226433  [234]  0.437592
4 -0.204653  [234]  0.106218

Note that dfs are optimised for scalar values, storing non scalar values defeats the point in my opinion as filtering, looking up, getting and setting become problematic to the point that it becomes a pain

Upvotes: 28

Related Questions