Ray
Ray

Reputation: 8553

Python Pandas. Creating DataFrame with Series does not preserve dtype

I have a use-case which I thought would be quite common, and so I thought this question of mine should be easy to answer for myself, but I couldn't find the answer anywhere. Consider the following.

df = pandas.DataFrame({"id": numpy.random.choice(range(100), 5, replace=False),
                       "value": numpy.random.rand(5)})
df2 = pandas.DataFrame([df["id"], df["value"]*2]).T

Basically I'm creating a DataFrame, df2, based on the values of an old DataFrame, df. Now if we run

print(df.dtypes, end="\n------\n")
print(df2.dtypes)

we get

id         int64
value    float64
dtype: object
------
id       float64
value    float64
dtype: object

You can see that the dtype of the first column of df2 is float64, instead of int64 as it should be, even though the dtype of the Series itself is int64. This behaviour is very perplexing to me, and I can't believe that it's intentional. How do I create a DataFrame from some Seriess and preserve the dtypes of the Seriess? In my mind it should be as easy as pandas.DataFrame([s1, s2], dtypes=[int, float]), but you can't do that in pandas for some reason.

Upvotes: 3

Views: 536

Answers (1)

unutbu
unutbu

Reputation: 879083

Columns of a DataFrame always have a single dtype. (This is because, under the hood, Pandas stores columns of data which have the same dtype in blocks.)

When pd.DataFrame is passed a list of Series, it unpacks each Series into a separate row. Since the Series have different dtypes, the columns end up with values with mixed dtypes. Pandas tries to resolve this by upgrading all values in each column to a single dtype.


You could define df2 with:

df2 = pd.DataFrame({'id': df["id"], 'value': df["value"]*2})

or

df2 = df.copy()
df2['value'] *= 2

or

df2 = pd.concat([df["id"], df["value"]*2], axis=1)

Upvotes: 4

Related Questions