clstaudt
clstaudt

Reputation: 22448

pandas: error on DataFrame.unstack

I wrote the following function to convert several columns of a dataframe into numeric values:

def factorizeMany(data, columns):
    """ Factorize a bunch of columns in a data frame"""
    data[columns] = data[columns].stack().rank(method='dense').unstack()
    return data

Calling it like this

trainDataPre = factorizeMany(trainDataMerged.fillna(0), columns=["char_{0}".format(i) for i in range(1,10)])

gives me an error. I don't know where to look for the cause, possibly wrong input?

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-14-357f8a4b2ef8> in <module>()
      1 #trainDataPre = trainDataMerged.drop(["people_id", "activity_id", "date"], axis=1)
      2 #trainDataPre = trainDataMerged.fillna(0)
----> 3 trainDataPre = mininggear.factorizeMany(trainDataMerged.fillna(0), columns=["char_{0}".format(i) for i in range(1,10)])

/Users/cls/Dropbox/Datengräber/Kaggle/RedHat/mininggear.py in factorizeMany(data, columns)
     15 def factorizeMany(data, columns):
     16     """ Factorize a bunch of columns in a data frame"""
---> 17     data[columns] = data[columns].stack().rank(method='dense').unstack()
     18     return data
     19 

/usr/local/lib/python3.5/site-packages/pandas/core/series.py in unstack(self, level, fill_value)
   2041         """
   2042         from pandas.core.reshape import unstack
-> 2043         return unstack(self, level, fill_value)
   2044 
   2045     # ----------------------------------------------------------------------

/usr/local/lib/python3.5/site-packages/pandas/core/reshape.py in unstack(obj, level, fill_value)
    405     else:
    406         unstacker = _Unstacker(obj.values, obj.index, level=level,
--> 407                                fill_value=fill_value)
    408         return unstacker.get_result()
    409 

/usr/local/lib/python3.5/site-packages/pandas/core/reshape.py in __init__(self, values, index, level, value_columns, fill_value)
     90 
     91         # when index includes `nan`, need to lift levels/strides by 1
---> 92         self.lift = 1 if -1 in self.index.labels[self.level] else 0
     93 
     94         self.new_index_levels = list(index.levels)

AttributeError: 'Index' object has no attribute 'labels'

Upvotes: 2

Views: 1354

Answers (1)

Nickil Maveli
Nickil Maveli

Reputation: 29721

The error is due to the fact that you are trying to perform the rank operation on the subset of the dataframe containing both numerical and categorical/string values by filling the NaN's in the dataframe with 0 and calling that function.

Consider this case:

df = pd.DataFrame({'char_1': ['cat', 'dog', 'buffalo', 'cat'],
                   'char_2': ['mouse', 'tiger', 'lion', 'mouse'],
                   'char_3': ['giraffe', np.NaN, 'cat', np.NaN]})
df 

Image

df = df.fillna(0)
df[['char_3']].stack().rank()
Series([], dtype: float64)

So, you are basically performing the unstack operation on an empty series which is not what you wanted to do after all.

Better is to do this way to avoid further complications:

def factorizeMany(data, columns):
    """ Factorize a bunch of columns in a data frame"""
    stacked = data[columns].stack(dropna=False)
    data[columns] = pandas.Series(stacked.factorize()[0], index=stacked.index).unstack()
    return data

Upvotes: 1

Related Questions