Niccola Tartaglia
Niccola Tartaglia

Reputation: 1667

Convert pandas Dataframe to numeric

My dataframe appears to be non-numeric after some transformations (see previous post on dropping duplicates: drop duplicates pandas dataframe)

When I use it in a statsmodels regression I get this error:

ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).

Can I convert the entire dataframe back to numeric somehow?

Using the dataframe with sklearn works for some reason

I am actually not sure what the data type is, only after opening the dataframe in spyder I noticed that it is not colered anymore. When I used type(df) it just tells me that it is a dataframe.

This is an example from the post I mentioned where the transformation occurs (compare the df before and after the last line):

  dict1 = [{'var0': 0, 'var1': 0, 'var2': 2},
     {'var0': 0, 'var1': 0, 'var2': 4},
     {'var0': 0, 'var1': 0, 'var2': 8},
     {'var0':0, 'var1': 0, 'var2': 12},]


 df = pd.DataFrame(dict1, index=['s1', 's2','s1','s2'])

df.reset_index().T.drop_duplicates().T.set_index('index')

This is the dataframe before running the last line:

 df.info()
 <class 'pandas.core.frame.DataFrame'>
 Index: 4 entries, s1 to s2
 Data columns (total 3 columns):
 var0    4 non-null int64
 var1    4 non-null int64
 var2    4 non-null int64
 dtypes: int64(3)

And this is after:

  df.info()
  <class 'pandas.core.frame.DataFrame'>
  Index: 4 entries, s1 to s2 
  Data columns (total 2 columns):
  var0    4 non-null object
  var2    4 non-null object
  dtypes: object(2)
  memory usage: 96.0+ bytes

After the transformation:

   print(df)
  var0 var2
  index          
 s1       0    2
 s2       0    4
 s1       0    8
 s2       0   12

Upvotes: 2

Views: 6283

Answers (1)

Haleemur Ali
Haleemur Ali

Reputation: 28253

One issue with the original answer in this post is that the transformation converts the integers to objects. This happens after the transpose since now the same column stores integers as well as the index which is textual.

Instead, you can sidestep the issue like this:

out = df.reset_index(drop=True).T.drop_duplicates().T.set_index(df.index)
out
    var0  var2
s1     0     2
s2     0     4
s1     0     8
s2     0    12

Or, if your actual example is sufficiently different that you can't use the above, there is always casting, i.e.

out.astype(int)

Upvotes: 3

Related Questions