Reputation: 8856
I'm trying to create an empty data frame with an index and specify the column types. The way I am doing it is the following:
df = pd.DataFrame(index=['pbp'],
dtype=['str', 'str', 'str', 'str',
'int', 'float', 'float',
'int', 'float'])
However, I get the following error,
TypeError: data type not understood
What does this mean?
Upvotes: 156
Views: 204729
Reputation: 361
My solution (without setting an index) is to initialize a dataframe with column names and specify data types using astype()
schema = {
'contract' : str,
'state_and_county_code': str,
'state': str,
'county': str,
'starting_membership': int,
'starting_raw_raf': float,
'enrollment_trend': float,
'projected_membership': int,
'projected_raf': float,
df = pd.DataFrame(columns=schema).astype(schema)
Upvotes: 25
Reputation: 1777
One could use dataclass for easy maintenance, as follows:
from dataclasses import dataclass
class Contract:
contract: str = 'contract'
state_and_county_code: str = 'zip'
state: str = 'state'
county: str = 'county'
starting_membership: float = 0.0
starting_raw_raf: float = 0.0
enrollment_trend: float = 0.0
projected_membership: int = 0
projected_raf : float = 0.0
def empty(self):
empty_df = pd.DataFrame([self.__dict__]).iloc[0:0]
return empty_df
To get an empty df, instantiate as follows:
empty_contract_df = Contract().empty()
Upvotes: 1
Reputation: 1393
You can use the following:
df = pd.DataFrame({'a': pd.Series(dtype='int'),
'b': pd.Series(dtype='str'),
'c': pd.Series(dtype='float')})
or more abstractly:
df = pd.DataFrame({c: pd.Series(dtype=t) for c, t in {'a': 'int', 'b': 'str', 'c': 'float'}.items()})
If you then use df
, you have:
>>> df
Empty DataFrame
Columns: [a, b, c]
Index: []
and if you check its types:
>>> df.dtypes
a int32
b object
c float64
dtype: object
Upvotes: 132
Reputation: 5222
Not working, just a remark.
You can get around the Type Error using np.dtype
pd.DataFrame(index = ['pbp'], columns = ['a','b'], dtype = np.dtype([('str','float')]))
but you get instead:
NotImplementedError: compound dtypes are not implementedin the DataFrame constructor
Upvotes: 13
Reputation: 10626
directlyimport numpy as np
import pandas as pd
df = pd.DataFrame(
{'a': np.ndarray((0,), dtype=int),
'b': np.ndarray((0,), dtype=str),
'c': np.ndarray((0,), dtype=float)
a int64
b object
c float64
dtype: object
This is also the fastest way of doing it, as can be seen in the following
Python 3.8.8 (default, Apr 13 2021, 19:58:26)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.18.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: %timeit pd.DataFrame({'a': np.ndarray((0,), dtype=int), 'b': np.ndarray(
...: (0,), dtype=str), 'c': np.ndarray((0,), dtype=float)})
183 µs ± 388 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [4]:
In [4]: def df_empty(columns, dtypes, index=None):
...: assert len(columns)==len(dtypes)
...: df = pd.DataFrame(index=index)
...: for c,d in zip(columns, dtypes):
...: df[c] = pd.Series(dtype=d)
...: return df
...: %timeit df_empty(['a', 'b', 'c'], dtypes=[int, str, float])
1.14 ms ± 2.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]:
In [5]: %timeit pd.DataFrame({'a': pd.Series(dtype='int'), 'b': pd.Series(dtype='str'), 'c': pd.Series(dtype='float')})
564 µs ± 658 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 2
Reputation: 528
I recommend this:
columns = ["a", "b"]
types = ['float32', 'str']
predefined_size = 10
df = pd.DataFrame({c: pd.Series(index=range(predefined_size), dtype=t)
for c,t in zip(columns, types)})
Upvotes: 0
Reputation: 340
Create empty dataframe in Pandas specifying column types:
import pandas as pd
c1 = pd.Series(data=None, dtype='string', name='c1')
c2 = pd.Series(data=None, dtype='bool', name='c2')
c3 = pd.Series(data=None, dtype='float', name='c3')
c4 = pd.Series(data=None, dtype='int', name='c4')
df = pd.concat([c1, c2, c3, c4], axis=1)'verbose')
We create columns as Series and give them the correct dtype, then we concat the Series into a DataFrame, and that's it
We have the DataFrame constructor with dtypes!
<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 c1 0 non-null string
1 c2 0 non-null bool
2 c3 0 non-null float64
3 c4 0 non-null int32
dtypes: bool(1), float64(1), int32(1), string(1)
memory usage: 0.0+ bytes
Upvotes: 1
Reputation: 18978
One way to do it:
import numpy
import pandas
dtypes = numpy.dtype(
("a", str),
("b", int),
("c", float),
("d", numpy.datetime64),
df = pandas.DataFrame(numpy.empty(0, dtype=dtypes))
Upvotes: 41
Reputation: 563
Taking lists columns and dtype from your examle you can do the following:
cdt={i[0]: i[1] for i in zip(columns, dtype)} # make column type dict
pdf=pd.DataFrame(columns=list(cdt)) # create empty dataframe
pdf=pdf.astype(cdt) # set desired column types
DataFrame doc says only a single dtype is allowed in constructor call.
Upvotes: 5
Reputation: 1952
This is an old question, but I don't see a solid answer (although @eric_g was super close).
You just need to create an empty dataframe with a dictionary of key:value pairs. The key being your column name, and the value being an empty data type.
So in your example dataset, it would look as follows (pandas 0.25 and python 3.7):
variables = {'contract':'',
df = pd.DataFrame(variables, index=[])
In old pandas versions, one may have to do:
df = pd.DataFrame(columns=[variables])
Upvotes: 28
Reputation: 59731
I found the easiest workaround for me was to simply concatenate a list of empty series for each individual column:
import pandas as pd
columns = ['contract',
dtype = ['str', 'str', 'str', 'str', 'int', 'float', 'float', 'int', 'float']
df = pd.concat([pd.Series(name=col, dtype=dt) for col, dt in zip(columns, dtype)], axis=1)
# <class 'pandas.core.frame.DataFrame'>
# Index: 0 entries
# Data columns (total 9 columns):
# contract 0 non-null object
# state_and_county_code 0 non-null object
# state 0 non-null object
# county 0 non-null object
# starting_membership 0 non-null int32
# starting_raw_raf 0 non-null float64
# enrollment_trend 0 non-null float64
# projected_membership 0 non-null int32
# projected_raf 0 non-null float64
# dtypes: float64(3), int32(2), object(4)
# memory usage: 0.0+ bytes
Upvotes: 3
Reputation: 15810
This really smells like a bug.
Here's another (simpler) solution.
import pandas as pd
import numpy as np
def df_empty(columns, dtypes, index=None):
assert len(columns)==len(dtypes)
df = pd.DataFrame(index=index)
for c,d in zip(columns, dtypes):
df[c] = pd.Series(dtype=d)
return df
df = df_empty(['a', 'b'], dtypes=[np.int64, np.int64])
print(list(df.dtypes)) # int64, int64
Upvotes: 28
Reputation: 642
You can do this by passing a dictionary into the DataFrame constructor:
df = pd.DataFrame(index=['pbp'],
data={'contract' : np.full(1, "", dtype=str),
'starting_membership' : np.full(1, np.nan, dtype=float),
'projected_membership' : np.full(1, np.nan, dtype=int)
This will correctly give you a dataframe that looks like:
contract projected_membership starting_membership
pbp "" NaN -9223372036854775808
With dtypes:
contract object
projected_membership float64
starting_membership int64
That said, there are two things to note:
1) str
isn't actually a type that a DataFrame column can handle; instead it falls back to the general case object
. It'll still work properly.
2) Why don't you see NaN
under starting_membership
? Well, NaN
is only defined for floats; there is no "None" value for integers, so it casts np.NaN
to an integer. If you want a different default value, you can change that in the np.full
Upvotes: 2
Reputation: 3936
I found this question after running into the same issue. I prefer the following solution (Python 3) for creating an empty DataFrame with no index.
import numpy as np
import pandas as pd
def make_empty_typed_df(dtype):
tdict = np.typeDict
types = tuple(tdict.get(t, t) for (_, t, *__) in dtype)
if any(t == np.void for t in types):
raise NotImplementedError('Not Implemented for columns of type "void"')
return pd.DataFrame.from_records(np.array([tuple(t() for t in types)], dtype=dtype)).iloc[:0, :]
Testing this out ...
from itertools import chain
dtype = [('col%d' % i, t) for i, t in enumerate(chain(np.typeDict, set(np.typeDict.values())))]
dtype = [(c, t) for (c, t) in dtype if (np.typeDict.get(t, t) != np.void) and not isinstance(t, int)]
Empty DataFrame
Columns: [col0, col6, col16, col23, col24, col25, col26, col27, col29, col30, col31, col32, col33, col34, col35, col36, col37, col38, col39, col40, col41, col42, col43, col44, col45, col46, col47, col48, col49, col50, col51, col52, col53, col54, col55, col56, col57, col58, col60, col61, col62, col63, col64, col65, col66, col67, col68, col69, col70, col71, col72, col73, col74, col75, col76, col77, col78, col79, col80, col81, col82, col83, col84, col85, col86, col87, col88, col89, col90, col91, col92, col93, col95, col96, col97, col98, col99, col100, col101, col102, col103, col104, col105, col106, col107, col108, col109, col110, col111, col112, col113, col114, col115, col117, col119, col120, col121, col122, col123, col124, ...]
Index: []
[0 rows x 146 columns]
And the datatypes ...
col0 timedelta64[ns]
col6 uint16
col16 uint64
col23 int8
col24 timedelta64[ns]
col25 bool
col26 complex64
col27 int64
col29 float64
col30 int8
col31 float16
col32 uint64
col33 uint8
col34 object
col35 complex128
col36 int64
col37 int16
col38 int32
col39 int32
col40 float16
col41 object
col42 uint64
col43 object
col44 int16
col45 object
col46 int64
col47 int16
col48 uint32
col49 object
col50 uint64
col144 int32
col145 bool
col146 float64
col147 datetime64[ns]
col148 object
col149 object
col150 complex128
col151 timedelta64[ns]
col152 int32
col153 uint8
col154 float64
col156 int64
col157 uint32
col158 object
col159 int8
col160 int32
col161 uint64
col162 int16
col163 uint32
col164 object
col165 datetime64[ns]
col166 float32
col167 bool
col168 float64
col169 complex128
col170 float16
col171 object
col172 uint16
col173 complex64
col174 complex128
dtype: object
Adding an index gets tricky because there isn't a true missing value for most data types so they end up getting cast to some other type with a native missing value (e.g., int
s are cast to float
s or object
s), but if you have complete data of the types you've specified, then you can always insert rows as needed, and your types will be respected. This can be accomplished with:
df.loc[index, :] = new_row
Again, as @Hun pointed out, this NOT how Pandas is intended to be used.
Upvotes: 5
Reputation: 3857
pandas doesn't offer pure integer column. You can either use float column and convert that column to integer as needed or treat it like an object. What you are trying to implement is not the way pandas is supposed to be used. But if you REALLY REALLY want that, you can get around the TypeError message by doing this.
df1 = pd.DataFrame(index=['pbp'], columns=['str1','str2','str2'], dtype=str)
df2 = pd.DataFrame(index=['pbp'], columns=['int1','int2'], dtype=int)
df3 = pd.DataFrame(index=['pbp'], columns=['flt1','flt2'], dtype=float)
df = pd.concat([df1, df2, df3], axis=1)
str1 str2 str2 int1 int2 flt1 flt2
pbp NaN NaN NaN NaN NaN NaN NaN
You can rearrange the col order as you like. But again, this is not the way pandas was supposed to be used.
str1 object
str2 object
str2 object
int1 object
int2 object
flt1 float64
flt2 float64
dtype: object
Note that int is treated as object.
Upvotes: 1