Reputation: 41
According to the documentation, we get DtypeWarning: Columns (0) have mixed types if the data contains int
and str
and it is suggested to set low_memory=False, and this removes the warning. But my task is the opposite: to define columns with mixed types!
At first, I thought to parse the DtypeWarning message so that I could understand which columns has mixed type, but I encountered many difficulties that prevent me from relying on DtypeWarning:
df = pd.DataFrame({'a': (['1'] * 100000 + ['X'] * 100000 + ['1'] * 100000), 'b': ['b'] * 300000})
df.to_csv('test.csv', index=False)
df2 = pd.read_csv('test.csv')
# DtypeWarning: Columns (0) have mixed types
df = pd.DataFrame({'a': ([1] * 10000 + ['X'] * 10000 + [1] * 10000) * 10, 'b': ['b'] * 300000})
df.to_csv('test.csv', index=False)
df2 = pd.read_csv('test.csv')
# No warning
It still has mixed types, but warning doesn't appears. And if I analyze types, all of these are str
. Ie I can't analyze mixed types even by myself.
So, How I can get columns with mixed types? Is it possible to add the parameter read_csv(mixed_types=True) and force the pandas not to hide mixed types for all datasets or for at least 100 000 rows? Or any ideas?
Thanks.
It seems that pandas does not allow to know which columns have mixed types, but on the contrary hides mixed types behind the dtype object
with str
inside. DtypeWarning as an exception to the rule. The link from @pygo answer explains the randomness of the DtypeWarning.
Upvotes: 1
Views: 1863
Reputation: 8816
It should work both row & columns.
low_memory : boolean, default True
Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter. Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks. (Only valid with C parser)
This is From github page
It is deterministic - types are consistently inferred based on what's in the data. That said, the internal chunksize is not a fixed number of rows, but instead bytes, so whether you can a mixed dtype warning or not can feel a bit random.
I think you should not bother about those message as these error message is generic.
df2 = pd.read_csv('test.csv', engine='c', dtype={'FULL': 'str', 'COUNT': 'int'}, header=1)
Upvotes: 2