Reputation: 315
I would like to control the output dtypes for apply on a row. foo and bar below have multiple outputs.
import pandas as pd
def foo(x):
return x['a'] * x['b'], None, x['a'] > x['b']
def bar(x):
return x['a'] * x['b'], None
df = pd.DataFrame([{'a': 10, 'b': 2}, {'a': 10, 'b': 20}])
df2 = df.copy()
df[['product', 'dummy', 'greater']] = df.apply(foo, axis=1, result_type='expand')
df2[['product', 'dummy']] = df2.apply(bar, axis=1, result_type='expand')
The output dtypes are:
col | df | df2 |
---|---|---|
a | int64 | int64 |
b | int64 | int64 |
product | int64 | float64 |
dummy | object | float64 |
greater | bool | - |
A comment to this question pandas apply changing dtype, suggests that apply returns a series with a single dtype. That may be the case with bar since the outputs can be cast to float. But it doesn't seem to be the case for foo, because then the outputs would need to be object.
Is it possible to control the output dtypes of apply? I.e. get/specify the output dtypes (int, object) for bar, or do I need to cast the dtype at the end?
Background: I have a dataframe where the dummy column has values True, False and None and dtype 'object'. The apply function runs on some corner cases, and introduces NaN instead of None. I'm replacing the NaN with None after apply, but it seems overly complicated.
pandas version 1.5.2
Upvotes: 0
Views: 84
Reputation: 37737
IIUC, you're asking why product
and dummy
have different dtypes after applying foo
and bar
even though the values returned by those functions are the same for those new columns ?
col df df2
0 a int64 int64
1 b int64 int64
2 product int64 float64 # int64 <> float64
3 dummy object float64 # object <> float64
4 greater bool
If so, that's because when result_type == "expand"
, there is a specific transformation done behind the scenes with infer_to_same_shape
, which is roughly equivalent to this :
_datafoo = {0: (20, None, True), 1: (200, None, False)}
_databar = {0: (20, None), 1: (200, None)}
expandfoo = pd.DataFrame(_datafoo).T.set_axis(df.index).infer_objects()
expandbar = pd.DataFrame(_databar).T.set_axis(df.index).infer_objects()
Output (foo) :
print(expandfoo.T, expandfoo, expandfoo.dtypes.to_dict(), sep="\n"*2)
0 1
0 20 200
1 None None
2 True False
0 1 2
0 20 None True
1 200 None False
{0: dtype('int64'), 1: dtype('O'), 2: dtype('bool')}
Output (bar) :
print(expandbar.T, expandbar, expandbar.dtypes.to_dict(), sep="\n"*2)
A B
0 20.0 200.0
1 NaN NaN # <-- see the presence of NaN
0 1
0 20.0 NaN
1 200.0 NaN
{0: dtype('float64'), 1: dtype('float64')}
As you can see, infer_objects
keeps expandbar
inferred as float64
for both columns (if this is unintuitive, see GH28318).
Is it possible to control the output dtypes of
apply
?
That depends on the computation made by the applied function and the values returned. So yes, you have somehow this kind of control but you can always add convert_dtypes
or astype
at the end.
Upvotes: 1