Reputation: 22448
I wrote the following function to convert several columns of a dataframe into numeric values:
def factorizeMany(data, columns):
""" Factorize a bunch of columns in a data frame"""
data[columns] = data[columns].stack().rank(method='dense').unstack()
return data
Calling it like this
trainDataPre = factorizeMany(trainDataMerged.fillna(0), columns=["char_{0}".format(i) for i in range(1,10)])
gives me an error. I don't know where to look for the cause, possibly wrong input?
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-14-357f8a4b2ef8> in <module>()
1 #trainDataPre = trainDataMerged.drop(["people_id", "activity_id", "date"], axis=1)
2 #trainDataPre = trainDataMerged.fillna(0)
----> 3 trainDataPre = mininggear.factorizeMany(trainDataMerged.fillna(0), columns=["char_{0}".format(i) for i in range(1,10)])
/Users/cls/Dropbox/Datengräber/Kaggle/RedHat/mininggear.py in factorizeMany(data, columns)
15 def factorizeMany(data, columns):
16 """ Factorize a bunch of columns in a data frame"""
---> 17 data[columns] = data[columns].stack().rank(method='dense').unstack()
18 return data
19
/usr/local/lib/python3.5/site-packages/pandas/core/series.py in unstack(self, level, fill_value)
2041 """
2042 from pandas.core.reshape import unstack
-> 2043 return unstack(self, level, fill_value)
2044
2045 # ----------------------------------------------------------------------
/usr/local/lib/python3.5/site-packages/pandas/core/reshape.py in unstack(obj, level, fill_value)
405 else:
406 unstacker = _Unstacker(obj.values, obj.index, level=level,
--> 407 fill_value=fill_value)
408 return unstacker.get_result()
409
/usr/local/lib/python3.5/site-packages/pandas/core/reshape.py in __init__(self, values, index, level, value_columns, fill_value)
90
91 # when index includes `nan`, need to lift levels/strides by 1
---> 92 self.lift = 1 if -1 in self.index.labels[self.level] else 0
93
94 self.new_index_levels = list(index.levels)
AttributeError: 'Index' object has no attribute 'labels'
Upvotes: 2
Views: 1354
Reputation: 29721
The error is due to the fact that you are trying to perform the rank
operation on the subset of the dataframe containing both numerical and categorical/string values by filling the NaN's
in the dataframe with 0 and calling that function.
Consider this case:
df = pd.DataFrame({'char_1': ['cat', 'dog', 'buffalo', 'cat'],
'char_2': ['mouse', 'tiger', 'lion', 'mouse'],
'char_3': ['giraffe', np.NaN, 'cat', np.NaN]})
df
df = df.fillna(0)
df[['char_3']].stack().rank()
Series([], dtype: float64)
So, you are basically performing the unstack
operation on an empty series which is not what you wanted to do after all.
Better is to do this way to avoid further complications:
def factorizeMany(data, columns):
""" Factorize a bunch of columns in a data frame"""
stacked = data[columns].stack(dropna=False)
data[columns] = pandas.Series(stacked.factorize()[0], index=stacked.index).unstack()
return data
Upvotes: 1