Reputation: 8553
I have a use-case which I thought would be quite common, and so I thought this question of mine should be easy to answer for myself, but I couldn't find the answer anywhere. Consider the following.
df = pandas.DataFrame({"id": numpy.random.choice(range(100), 5, replace=False),
"value": numpy.random.rand(5)})
df2 = pandas.DataFrame([df["id"], df["value"]*2]).T
Basically I'm creating a DataFrame
, df2
, based on the values of an old DataFrame
, df
. Now if we run
print(df.dtypes, end="\n------\n")
print(df2.dtypes)
we get
id int64
value float64
dtype: object
------
id float64
value float64
dtype: object
You can see that the dtype
of the first column of df2
is float64
, instead of int64
as it should be, even though the dtype
of the Series
itself is int64
. This behaviour is very perplexing to me, and I can't believe that it's intentional. How do I create a DataFrame
from some Series
s and preserve the dtype
s of the Series
s? In my mind it should be as easy as pandas.DataFrame([s1, s2], dtypes=[int, float])
, but you can't do that in pandas
for some reason.
Upvotes: 3
Views: 536
Reputation: 879083
Columns of a DataFrame always have a single dtype. (This is because, under the hood, Pandas stores columns of data which have the same dtype in blocks.)
When pd.DataFrame
is passed a list of Series, it
unpacks each Series into a separate row. Since the Series have different dtypes, the columns end up with values with mixed dtypes. Pandas tries to resolve this by upgrading all values in each column to a single dtype.
You could define df2
with:
df2 = pd.DataFrame({'id': df["id"], 'value': df["value"]*2})
or
df2 = df.copy()
df2['value'] *= 2
or
df2 = pd.concat([df["id"], df["value"]*2], axis=1)
Upvotes: 4