Reputation:
I am trying to create a column from two other columns in a DataFrame
.
Consider the 3-column data frame:
import numpy as np
import pandas as pd
random_list_1 = np.random.randint(1, 10, 5)
random_list_2 = np.random.randint(1, 10, 5)
random_list_3 = np.random.randint(1, 10, 5)
df = pd.DataFrame({"p": random_list_1, "q": random_list_2, "r": random_list_3})
I create a new column from "p"
and "q"
with a function that will be given to apply
.
As a simple example:
def operate(row):
return [row['p'], row['q']]
Here,
df['s'] = df.apply(operate, axis = 1)
evaluates correctly and creates a column "s"
.
The issue appears when I am considering a data frame with a number of columns equal to the length of the list output by operate
. So for instance with
df2 = pd.DataFrame({"p": random_list_1, "q": random_list_2})
evaluating this:
df2['s'] = df2.apply(operate, axis = 1)
throws a ValueError
exception:
ValueError: Wrong number of items passed 2, placement implies 1
What is happening?
As a workaround, I could make operate
return tuples (which does not throw an exception) and then convert them to lists, but for performance sake I would prefer getting lists in one reading only of the DataFrame
.
Is there a way to achieve this?
Upvotes: 1
Views: 67
Reputation: 13437
In both of the cases this work for me:
df["s"] = list(np.column_stack((df.p.values,df.q.values)))
Working with vectorized function is better than use apply. In this case the speed boost is 3x
. See documentation
Anyway I found your question interesting and I'd like to know why this is happening.
Upvotes: 0