Socob
Socob

Reputation: 1284

Preserving dtypes when extracting a row from a pandas DataFrame

Extracting a single row from a pandas DataFrame (e.g. using .loc or .iloc) yields a pandas Series. However, when dealing with heterogeneous data in the DataFrame (i.e. the DataFrame’s columns are not all the same dtype), this causes all the values from the different columns in the row to be coerced into a single dtype, because a Series can only have one dtype. Here is a simple example to show what I mean:

import numpy
import pandas

a = numpy.arange(5, dtype='i8')
b = numpy.arange(5, dtype='u8')**2
c = numpy.arange(5, dtype='f8')**3
df = pandas.DataFrame({'a': a, 'b': b, 'c': c})
df.dtypes
# a      int64
# b     uint64
# c    float64
# dtype: object
df
#    a   b     c
# 0  0   0   0.0
# 1  1   1   1.0
# 2  2   4   8.0
# 3  3   9  27.0
# 4  4  16  64.0
df.loc[2]
# a    2.0
# b    4.0
# c    8.0
# Name: 2, dtype: float64

All values in df.loc[2] have been converted to float64.

Is there a good way to extract a row without incurring this type conversion? I could imagine e.g. returning a numpy structured array, but I don’t see a hassle-free way of creating such an array.

Upvotes: 1

Views: 1834

Answers (3)

banjaxed
banjaxed

Reputation: 243

From the official documentation, use [[]] with .loc to return a DataFrame instead of a Series. This preserves the dtypes of the columns. Using your original example:

>>> import numpy
>>> import pandas
>>> a = numpy.arange(5, dtype='i8')
>>> b = numpy.arange(5, dtype='u8')**2
>>> c = numpy.arange(5, dtype='f8')**3
>>> df = pandas.DataFrame({'a': a, 'b': b, 'c': c})
>>> df.dtypes
a      int64
b     uint64
c    float64
dtype: object

>>> df
   a   b     c
0  0   0   0.0
1  1   1   1.0
2  2   4   8.0
3  3   9  27.0
4  4  16  64.0

>>> df.loc[[2]]
   a  b    c
2  2  4  8.0

>>> df.loc[[2]].dtypes
a      int64
b     uint64
c    float64
dtype: object

>>> df.loc[[2]].iloc[0].name 
2

Upvotes: 0

sandyscott
sandyscott

Reputation: 248

Another approach (but it feels slightly hacky):

Instead of using an integer with loc or iloc, you can use a slicer with length 1. This returns a DataFrame with length 1, so iloc[0] contains your data. e.g

In[1] : row2 = df[2:2+1]
In[2] : type(row)
Out[2]: pandas.core.frame.DataFrame
In[3] : row2.dtypes
Out[3]: 
a      int64
b     uint64
c    float64
In[4] : a2 = row2.a.iloc[0]
In[5] : type(a2)
Out[5]: numpy.int64
In[6] : c2 = row2.c.iloc[0]
In[7] : type(c2)
Out[7]: numpy.float64

To me this feels preferable to converting the data types twice (once during row extraction, and again afterwards), and clearer than referring to the original DataFrame multiple times with the same row specification (which could be computationally expensive).

I think it would be better if pandas had a DataFrameRow type for this siutation.

Upvotes: 4

Andy L.
Andy L.

Reputation: 25269

As you already realized, series doesn't allow mixing dtypes. However, it allows mixed data type if you specify its dtypes as object. So, you may convert dtypes of dataframe to object. Every column will be in dtype object, but every value still keeps it data type of int and float

df1 = df.astype('O')

Out[10]:
   a   b   c
0  0   0   0
1  1   1   1
2  2   4   8
3  3   9  27
4  4  16  64

In [12]: df1.loc[2].map(type)
Out[12]:
a      <class 'int'>
b      <class 'int'>
c    <class 'float'>
Name: 2, dtype: object

Otherwise, you need to convert dataframe to np.recarray

n_recs = df.to_records(index=False)

Out[22]:
rec.array([(0,  0,  0.), (1,  1,  1.), (2,  4,  8.), (3,  9, 27.),
           (4, 16, 64.)],
          dtype=[('a', '<i8'), ('b', '<u8'), ('c', '<f8')])

Upvotes: 2

Related Questions