Chris
Chris

Reputation: 13660

How to set dtypes by column in pandas DataFrame

I want to bring some data into a pandas DataFrame and I want to assign dtypes for each column on import. I want to be able to do this for larger datasets with many different columns, but, as an example:

myarray = np.random.randint(0,5,size=(2,2))
mydf = pd.DataFrame(myarray,columns=['a','b'], dtype=[float,int])
mydf.dtypes

results in:

TypeError: data type not understood

I tried a few other methods such as:

mydf = pd.DataFrame(myarray,columns=['a','b'], dtype={'a': int})

TypeError: object of type 'type' has no len()

If I put dtype=(float,int) it applies a float format to both columns.

In the end I would like to just be able to pass it a list of datatypes the same way I can pass it a list of column names.

Upvotes: 59

Views: 58758

Answers (6)

Riley Hales
Riley Hales

Reputation: 7

A solution available in more recent versions of pandas (currently 2.X) is to pass DataFrame.astype() a dictionary of with column names as keys and the type that the values of the column should be are the values in the dictionary.

Other comments and answers said this was not possible in past versions but can be done at least in 2.X versions.

df = pd.DataFrame(
    {'some_ints': [1, 2, 3], 'some_strs': ['a', 'b', 'c']},
    dtype={'some_ints': 'str', 'some_strs': 'str'}
)

df.dtypes.to_dict()

>>> {'some_ints': dtype('O'), 'some_strs': dtype('O')}

df = df.astype({'some_ints': 'int64', 'some_strs': 'str'})

df.dtypes.to_dict()

>>> {'some_ints': dtype('int64'), 'some_strs': dtype('O')}

Another tip available if you chain together operations which could cause type conversions is to call .astype on the output of df.dtypes.to_dict()

Example:

df = (
    df
    .some_type_changing_method()
    .astype(df.dtypes.to_dict()
)

This ensures that your dtypes match at the end and beginning of the chained operations or will raise an error if the types cannot be converted (e.g. nans to ints).

Upvotes: 0

Sergej Steinhauer
Sergej Steinhauer

Reputation: 31

With Pandas version 1.5.3 is it possible to pass explicit data types:

import pandas as pd
data = (['Alex', 10],["Bob",12],["Clarke",11.05])
df = pd.DataFrame(data,columns=("Name", "Age"),dtype=(str, float))
print(df)

Upvotes: 3

user10983117
user10983117

Reputation: 1

while working with data types, they should be passed as strings.

For example the latter method you followed should be modified as

mydf = pd.DataFrame(myarray,columns=['a','b'], dtype={'a': 'int'})

instead of

mydf = pd.DataFrame(myarray,columns=['a','b'], dtype={'a': int}).

The dtype (int, float etc.) should be given as strings.

Or else as an Alternative method (iff you don't want to pass as strings) import numpy as np and use mydf = pd.DataFrame(myarray,columns=['a','b'], dtype={'a': np.int})

Upvotes: -3

user545424
user545424

Reputation: 16179

As of pandas version 0.24.2 (the current stable release) it is not possible to pass an explicit list of datatypes to the DataFrame constructor as the docs state:

dtype : dtype, default None

    Data type to force. Only a single dtype is allowed. If None, infer

However, the dataframe class does have a static method allowing you to convert a numpy structured array to a dataframe so you can do:

>>> myarray = np.random.randint(0,5,size=(2,2))
>>> record = np.array(map(tuple,myarray),dtype=[('a',np.float),('b',np.int)])
>>> mydf = pd.DataFrame.from_records(record)
>>> mydf.dtypes
a    float64
b      int64
dtype: object

Upvotes: 11

DBCerigo
DBCerigo

Reputation: 635

You may want to try passing in a dictionary of Series objects to the DataFrame constructor - it will give you much more specific control over the creation, and should hopefully be clearer what's going on. A template version (data1 can be an array etc.):

df = pd.DataFrame({'column1':pd.Series(data1, dtype='type1'),
                   'column2':pd.Series(data2, dtype='type2')})

And example with data:

df = pd.DataFrame({'A':pd.Series([1,2,3], dtype='int'),
                   'B':pd.Series([7,8,9], dtype='float')})

print (df)
   A  B
0  1  7.0
1  2  8.0
2  3  9.0

print (df.dtypes)
A     int32
B    float64
dtype: object

Upvotes: 14

mattexx
mattexx

Reputation: 6606

I just ran into this, and the pandas issue is still open, so I'm posting my workaround. Assuming df is my DataFrame and dtype is a dict mapping column names to types:

for k, v in dtype.items():
    df[k] = df[k].astype(v)

(note: use dtype.iteritems() in python 2)

For the reference:

Upvotes: 31

Related Questions