\n
To me this feels preferable to converting the data types twice (once during row extraction, and again afterwards), and clearer than referring to the original DataFrame multiple times with the same row specification (which could be computationally expensive).
\nI think it would be better if pandas had a DataFrameRow type for this siutation.
\n","author":{"@type":"Person","name":"sandyscott"},"upvoteCount":4}}}Reputation: 1284
Extracting a single row from a pandas
DataFrame
(e.g. using .loc
or .iloc
) yields a pandas
Series
. However, when dealing with heterogeneous data in the DataFrame
(i.e. the DataFrame
’s columns are not all the same dtype), this causes all the values from the different columns in the row to be coerced into a single dtype, because a Series
can only have one dtype. Here is a simple example to show what I mean:
import numpy
import pandas
a = numpy.arange(5, dtype='i8')
b = numpy.arange(5, dtype='u8')**2
c = numpy.arange(5, dtype='f8')**3
df = pandas.DataFrame({'a': a, 'b': b, 'c': c})
df.dtypes
# a int64
# b uint64
# c float64
# dtype: object
df
# a b c
# 0 0 0 0.0
# 1 1 1 1.0
# 2 2 4 8.0
# 3 3 9 27.0
# 4 4 16 64.0
df.loc[2]
# a 2.0
# b 4.0
# c 8.0
# Name: 2, dtype: float64
All values in df.loc[2]
have been converted to float64
.
Is there a good way to extract a row without incurring this type conversion? I could imagine e.g. returning a numpy
structured array, but I don’t see a hassle-free way of creating such an array.
Upvotes: 1
Views: 1834
Reputation: 243
From the official documentation, use [[]] with .loc to return a DataFrame instead of a Series. This preserves the dtypes of the columns. Using your original example:
>>> import numpy
>>> import pandas
>>> a = numpy.arange(5, dtype='i8')
>>> b = numpy.arange(5, dtype='u8')**2
>>> c = numpy.arange(5, dtype='f8')**3
>>> df = pandas.DataFrame({'a': a, 'b': b, 'c': c})
>>> df.dtypes
a int64
b uint64
c float64
dtype: object
>>> df
a b c
0 0 0 0.0
1 1 1 1.0
2 2 4 8.0
3 3 9 27.0
4 4 16 64.0
>>> df.loc[[2]]
a b c
2 2 4 8.0
>>> df.loc[[2]].dtypes
a int64
b uint64
c float64
dtype: object
>>> df.loc[[2]].iloc[0].name
2
Upvotes: 0
Reputation: 248
Another approach (but it feels slightly hacky):
Instead of using an integer with loc
or iloc
, you can use a slicer with length 1. This returns a DataFrame with length 1, so iloc[0]
contains your data. e.g
In[1] : row2 = df[2:2+1]
In[2] : type(row)
Out[2]: pandas.core.frame.DataFrame
In[3] : row2.dtypes
Out[3]:
a int64
b uint64
c float64
In[4] : a2 = row2.a.iloc[0]
In[5] : type(a2)
Out[5]: numpy.int64
In[6] : c2 = row2.c.iloc[0]
In[7] : type(c2)
Out[7]: numpy.float64
To me this feels preferable to converting the data types twice (once during row extraction, and again afterwards), and clearer than referring to the original DataFrame multiple times with the same row specification (which could be computationally expensive).
I think it would be better if pandas had a DataFrameRow type for this siutation.
Upvotes: 4
Reputation: 25269
As you already realized, series doesn't allow mixing dtypes
. However, it allows mixed data type if you specify its dtypes as object
. So, you may convert dtypes of dataframe to object
. Every column will be in dtype object
, but every value still keeps it data type of int
and float
df1 = df.astype('O')
Out[10]:
a b c
0 0 0 0
1 1 1 1
2 2 4 8
3 3 9 27
4 4 16 64
In [12]: df1.loc[2].map(type)
Out[12]:
a <class 'int'>
b <class 'int'>
c <class 'float'>
Name: 2, dtype: object
Otherwise, you need to convert dataframe to np.recarray
n_recs = df.to_records(index=False)
Out[22]:
rec.array([(0, 0, 0.), (1, 1, 1.), (2, 4, 8.), (3, 9, 27.),
(4, 16, 64.)],
dtype=[('a', '<i8'), ('b', '<u8'), ('c', '<f8')])
Upvotes: 2