Oleksandr Zaitsev
Oleksandr Zaitsev

Reputation: 41

What if I need to get columns with mixed types?

pandas: 0.23.4

According to the documentation, we get DtypeWarning: Columns (0) have mixed types if the data contains int and str and it is suggested to set low_memory=False, and this removes the warning. But my task is the opposite: to define columns with mixed types!

At first, I thought to parse the DtypeWarning message so that I could understand which columns has mixed type, but I encountered many difficulties that prevent me from relying on DtypeWarning:

  1. If you reduce the number of lines from 300,000 to 250,000, then DtypeWarning no longer appears, but I need it for at least 100,000 lines
  2. Even for 300,000 rows, the column is not always determined with mixed types, for example I modify dataframe from doc:

From the docs:

df = pd.DataFrame({'a': (['1'] * 100000 + ['X'] * 100000 + ['1'] * 100000), 'b': ['b'] * 300000})
df.to_csv('test.csv', index=False)
df2 = pd.read_csv('test.csv')
# DtypeWarning: Columns (0) have mixed types

My case:

df = pd.DataFrame({'a': ([1] * 10000 + ['X'] * 10000 + [1] *  10000) * 10, 'b': ['b'] * 300000})
df.to_csv('test.csv', index=False)
df2 = pd.read_csv('test.csv')
# No warning

It still has mixed types, but warning doesn't appears. And if I analyze types, all of these are str. Ie I can't analyze mixed types even by myself.

So, How I can get columns with mixed types? Is it possible to add the parameter read_csv(mixed_types=True) and force the pandas not to hide mixed types for all datasets or for at least 100 000 rows? Or any ideas?

Thanks.

Summary

It seems that pandas does not allow to know which columns have mixed types, but on the contrary hides mixed types behind the dtype object with str inside. DtypeWarning as an exception to the rule. The link from @pygo answer explains the randomness of the DtypeWarning.

Upvotes: 1

Views: 1863

Answers (1)

Karn Kumar
Karn Kumar

Reputation: 8816

It should work both row & columns.

low_memory : boolean, default True

Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter. Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks. (Only valid with C parser)

This is From github page

It is deterministic - types are consistently inferred based on what's in the data. That said, the internal chunksize is not a fixed number of rows, but instead bytes, so whether you can a mixed dtype warning or not can feel a bit random.

I think you should not bother about those message as these error message is generic.

OR

df2 = pd.read_csv('test.csv', engine='c', dtype={'FULL': 'str', 'COUNT': 'int'}, header=1)

Upvotes: 2

Related Questions