Djiggy
Djiggy

Reputation: 235

Shape of passed values error when using apply on pandas dataframe

I am trying to apply a basic spline function on all rows of a given dataframe (dfTest, which contains values for vector x) to obtain a bigger one (dfBigger) that would contain all values for vector xnew(which contains x).

I therefore define the following variables:

import pandas as pd
import numpy as np

x = [0,1,3,5]
xnew = range(0,6)

np.random.seed(123)
dfTest = pd.DataFrame(np.random.rand(12).reshape(3,4))

and the basic spline function :

def spline(y, x , xnew):
    from scipy import interpolate
    model = interpolate.splrep(x,y, s=0.)
    ynew = interpolate.splev(xnew,model)
    result = ynew.round(3)
    return result

which seems to work:

spline(dfTest.iloc[0],x,xnew)
Out[176]: array([ 0.696,  0.286,  0.161,  0.227,  0.388,  0.551])

but when I try to apply it on all rows using :

dfBigger = dfTest.apply(lambda row : spline(row, x, xnew), axis = 1)

I got this :

ValueError: Shape of passed values is (3, 6), indices imply (3, 4)

as dfBigger size is not defined anywhere I cannot see what is wrong. Any help and/or comment about this code would be appreciated.

Upvotes: 1

Views: 2187

Answers (1)

unutbu
unutbu

Reputation: 879501

df.apply(func) tries to build a new Series or DataFrame out of the values returned by func. The shape of the Series or DataFrame depends on the kind of value returned by func. To get a better handle on how df.apply behaves, experiment with the following calls:

dfTest.apply(lambda row: 1, axis=1)                       # Series
dfTest.apply(lambda row: [1], axis=1)                     # Series
dfTest.apply(lambda row: [1,2], axis=1)                   # Series
dfTest.apply(lambda row: [1,2,3], axis=1)                 # Series
dfTest.apply(lambda row: [1,2,3,4], axis=1)               # Series
dfTest.apply(lambda row: [1,2,3,4,5], axis=1)             # Series

dfTest.apply(lambda row: np.array([1]), axis=1)           # DataFrame
dfTest.apply(lambda row: np.array([1,2]), axis=1)         # ValueError
dfTest.apply(lambda row: np.array([1,2,3]), axis=1)       # ValueError
dfTest.apply(lambda row: np.array([1,2,3,4]), axis=1)     # DataFrame!
dfTest.apply(lambda row: np.array([1,2,3,4,5]), axis=1)   # ValueError

dfTest.apply(lambda row: pd.Series([1]), axis=1)          # DataFrame
dfTest.apply(lambda row: pd.Series([1,2]), axis=1)        # DataFrame
dfTest.apply(lambda row: pd.Series([1,2,3]), axis=1)      # DataFrame
dfTest.apply(lambda row: pd.Series([1,2,3,4]), axis=1)    # DataFrame
dfTest.apply(lambda row: pd.Series([1,2,3,4,5]), axis=1)  # DataFrame

So what rules can we draw from these experiments?

  • If func returns a scalar or a list, df.apply(func) returns a Series.
  • If func returns a Series, df.apply(func) returns a DataFrame.
  • If func returns a 1D NumPy array, and the array has only one element, df.apply(func) returns a DataFrame. (not a terribly useful case...)
  • If func returns a 1D NumPy array, and the array has the same number of elements as df has columns, df.apply(func) returns a DataFrame. (useful, but limited)

Since func returns 6 values, and you want a DataFrame as the result, the solution is to have func return a Series instead of a NumPy array:

def spline(y, x, xnew):
    ...
    return pd.Series(result)

import numpy as np
import pandas as pd
from scipy import interpolate

def spline(y, x, xnew):
    model = interpolate.splrep(x,y, s=0.)
    ynew = interpolate.splev(xnew,model)
    result = ynew.round(3)
    return pd.Series(result)

x = [0,1,3,5]
xnew = range(0,6)
np.random.seed(123)
dfTest = pd.DataFrame(np.random.rand(12).reshape(3,4))
# spline(dfTest.iloc[0],x,xnew)
dfBigger = dfTest.apply(lambda row : spline(row, x, xnew), axis=1)
print(dfBigger)

yields

        0      1      2      3      4      5
 0  0.696  0.286  0.161  0.227  0.388  0.551
 1  0.719  0.423  0.630  0.981  1.119  0.685
 2  0.481  0.392  0.333  0.343  0.462  0.729

Upvotes: 7

Related Questions