almonde
almonde

Reputation: 111

read_csv in Pandas while casting datatypes from a dictionary

I am reading a large CSV into a dataframe. Pandas is setting the dtype to float64 which is using a significant amount of memeory (2.4 GB for a 300MB CSV) I would like to set some of the dtypes to Int during import. I have a lot of NaNs in the dataset, so am trying to leverage the UInt dtype in Pandas 0.24.

Initially i am reading the CSV to extract just the header and row #3 which contains the source data dtypes. Its in the form of U1, U2 and F4, so i use df.replace to change them to the pandas dtypes, UInt8, UInt16 and float64 respectively. I then create a dictionary from this dataframe. I try to use that dictionary to define the dtype when i read the CSV, but i always get: Exception has occurred: TypeError data type not understood

import pandas as pd 
import numpy as np 

filename = r'input/TST_short.csv'
dtype_df = pd.read_csv(filename, nrows=1, skiprows=[1,2])

dtype_df.replace('U1', 'UInt8', inplace=True)
dtype_df.replace('U2', 'UInt16',inplace=True)
dtype_df.replace('F4', 'float64',inplace=True)

dtype_dict = dtype_df.to_dict('records')

df = pd.read_csv(filename, skiprows=range(1,5), nrows=500, dtype=dtype_dict)
print(df)

if I remove the dtype= argument the df looks like this:

           TIME     ST1F  ST2F  ST3F  ST4F    P12V
0    43476.34869  32768.0   1.0   0.0   0.0  12.161
1    43476.34870      NaN   NaN   NaN   NaN     NaN
2    43476.34872      NaN   NaN   NaN   NaN     NaN
3    43476.34873      NaN   NaN   NaN   NaN     NaN
4    43476.34874      NaN   NaN   NaN   NaN     NaN
5    43476.34875      NaN   NaN   NaN   NaN     NaN
6    43476.34876      NaN   NaN   NaN   NaN     NaN
7    43476.34877      NaN   NaN   NaN   NaN     NaN
8    43476.34878      NaN   NaN   NaN   NaN     NaN
9    43476.34880      NaN   NaN   NaN   NaN     NaN
10   43476.34881  32768.0   1.0   0.0   0.0     NaN
11   43476.34882      NaN   NaN   NaN   NaN  12.161
12   43476.34883      NaN   NaN   NaN   NaN     NaN
13   43476.34884      NaN   NaN   NaN   NaN     NaN 

Everything is cast as float64

The dictionary (dtype_dict) looks like this:

{'TIME': 'float64', 'ST1F': 'UInt16', 'ST2F': 'UInt8', 'ST3F': 'UInt8', 'ST4F': 'UInt8', 'P12V': 'float64'}

Go easy on me if i haven't provided enough info, its my first post and I am new to this :)

Update: Thanks for the help, it seems my dictionary is not quite formatted as it should be. I've tried all the dictionary types, I thought records was the closest but when i compare i can see i have square braces encloses the dictionary from the df.

d = dtype_df.to_dict('records')
d1 = {
    'TIME': 'float64',
    'ST1F': 'UInt16',
    'ST2F': 'UInt8',
    'ST3F': 'UInt8',
    'ST4F': 'UInt8',
    'P12V': 'float64'}

Output:

[{'TIME': 'float64', 'ST1F': 'UInt16', 'ST2F': 'UInt8', 'ST3F': 'UInt8', 'ST4F': 'UInt8', 'P12V': 'float64'}]
{'TIME': 'float64', 'ST1F': 'UInt16', 'ST2F': 'UInt8', 'ST3F': 'UInt8', 'ST4F': 'UInt8', 'P12V': 'float64'}

The dataframe i am generating the dictionary is in this form:

      TIME    ST1F   ST2F   ST3F   ST4F     P12V
0  float64  UInt16  UInt8  UInt8  UInt8  float64

Upvotes: 0

Views: 1773

Answers (2)

almonde
almonde

Reputation: 111

The issue was, I was outputting my dictionary as a list containing a dictionary, there might be a more elegant way to get around this, but i fixed it by passing the index[0] for the list containing the dictionary.

df = pd.read_csv(filename, skiprows=range(1,5), dtype=d[0])

EDIT: worked it out then saw a comment posted earlier with the answer :)

Upvotes: 0

fuglede
fuglede

Reputation: 18201

Your code works exactly as expected here. Are you sure you are using 0.24.0 or 0.24.1?

In [27]: pd.read_csv('test.csv').head()  # Spits out floats
Out[27]:
          TIME     ST1F  ST2F  ST3F  ST4F    P12V
0  43476.34869  32768.0   1.0   0.0   0.0  12.161
1  43476.34870      NaN   NaN   NaN   NaN     NaN
2  43476.34872      NaN   NaN   NaN   NaN     NaN
3  43476.34873      NaN   NaN   NaN   NaN     NaN
4  43476.34874      NaN   NaN   NaN   NaN     NaN

In [28]: pd.read_csv('test.csv', dtype=d).head()  # Has the proper types
Out[28]:
          TIME   ST1F  ST2F  ST3F  ST4F    P12V
0  43476.34869  32768     1     0     0  12.161
1  43476.34870    NaN   NaN   NaN   NaN     NaN
2  43476.34872    NaN   NaN   NaN   NaN     NaN
3  43476.34873    NaN   NaN   NaN   NaN     NaN
4  43476.34874    NaN   NaN   NaN   NaN     NaN

In [29]: d  # Let's check that this dict matches yours on the nose
Out[29]:
{'TIME': 'float64',
 'ST1F': 'UInt16',
 'ST2F': 'UInt8',
 'ST3F': 'UInt8',
 'ST4F': 'UInt8',
 'P12V': 'float64'}

In [30]: pd.read_csv('test.csv', dtype=d).info()  # And the types look good as well
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 6 columns):
TIME    14 non-null float64
ST1F    2 non-null UInt16
ST2F    2 non-null UInt8
ST3F    2 non-null UInt8
ST4F    2 non-null UInt8
P12V    2 non-null float64
dtypes: UInt16(1), UInt8(3), float64(2)
memory usage: 430.0 bytes

In [32]: pd.__version__
Out[32]: '0.24.0'

Upvotes: 2

Related Questions