Pandas apply with list output gives ValueError once df contains timeseries

Question

I'm trying to implement an apply function that returns two values because the calculations are similar and pretty time consuming, so I don't want to do apply twice. The below is an MWE that is pretty stupid and I know there are easier ways to achieve what this MWE does. My actual function is more complicated, but I already run into an error with this MWE:

So, I got this to work:

def function(row):
    return [row.A, row.A/2]

df = pd.DataFrame({'A' : np.random.randn(8),
                'B' : np.random.randn(8)})
df[['D','E']] = df.apply(lambda row: function(row), axis=1).apply(pd.Series)

However, this does not:

df2 = pd.DataFrame({'A' : np.random.randn(8),
                'B' : pd.date_range('1/1/2011', periods=8, freq='H'),
              'C' : np.random.randn(8)})
df2[['D','E']] = df2.apply(lambda row: function(row), axis=1).apply(pd.Series)

Instead, it gives me ValueError: Shape of passed values is (8, 2), indices imply (8, 3)

I don't understand why changing the type of the B column would impact the outcome, it is not even used in the apply function at all?

I guess I could avoid this issue in the example by temporary excluding the date column. However, in my function later I will need to use the date.

Can someone explain me, why this example does not work? What changes by including a TS?

piRSquared · Accepted Answer

have function return a pd.Series instead. Returning a list is making apply try to fit the list into the existing row. Returning a pd.Series convinces pandas of something different.

def function(row):
    return pd.Series([row.A, row.A/2])


df2 = pd.DataFrame({'A' : np.random.randn(8),
                    'B' : pd.date_range('1/1/2011', periods=8, freq='H'),
                    'C' : np.random.randn(8)})
df2[['D','E']] = df2.apply(function, axis=1)

df2

Attempt to explain

s = pd.Series([1, 2, 3])
s

0    1
1    2
2    3
dtype: int64

s.loc[:] = [4, 5, 6]
s

0    4
1    5
2    6
dtype: int64

s.loc[:] = [7, 8]

ValueError: cannot set using a slice indexer with a different length than the value

Pandas apply with list output gives ValueError once df contains timeseries

Answers (1)

Related Questions