Reputation: 111
I am reading a large CSV into a dataframe. Pandas is setting the dtype to float64 which is using a significant amount of memeory (2.4 GB for a 300MB CSV) I would like to set some of the dtypes to Int during import. I have a lot of NaNs in the dataset, so am trying to leverage the UInt dtype in Pandas 0.24.
Initially i am reading the CSV to extract just the header and row #3 which contains the source data dtypes. Its in the form of U1, U2 and F4, so i use df.replace to change them to the pandas dtypes, UInt8, UInt16 and float64 respectively. I then create a dictionary from this dataframe. I try to use that dictionary to define the dtype when i read the CSV, but i always get: Exception has occurred: TypeError data type not understood
import pandas as pd
import numpy as np
filename = r'input/TST_short.csv'
dtype_df = pd.read_csv(filename, nrows=1, skiprows=[1,2])
dtype_df.replace('U1', 'UInt8', inplace=True)
dtype_df.replace('U2', 'UInt16',inplace=True)
dtype_df.replace('F4', 'float64',inplace=True)
dtype_dict = dtype_df.to_dict('records')
df = pd.read_csv(filename, skiprows=range(1,5), nrows=500, dtype=dtype_dict)
print(df)
if I remove the dtype= argument the df looks like this:
TIME ST1F ST2F ST3F ST4F P12V
0 43476.34869 32768.0 1.0 0.0 0.0 12.161
1 43476.34870 NaN NaN NaN NaN NaN
2 43476.34872 NaN NaN NaN NaN NaN
3 43476.34873 NaN NaN NaN NaN NaN
4 43476.34874 NaN NaN NaN NaN NaN
5 43476.34875 NaN NaN NaN NaN NaN
6 43476.34876 NaN NaN NaN NaN NaN
7 43476.34877 NaN NaN NaN NaN NaN
8 43476.34878 NaN NaN NaN NaN NaN
9 43476.34880 NaN NaN NaN NaN NaN
10 43476.34881 32768.0 1.0 0.0 0.0 NaN
11 43476.34882 NaN NaN NaN NaN 12.161
12 43476.34883 NaN NaN NaN NaN NaN
13 43476.34884 NaN NaN NaN NaN NaN
Everything is cast as float64
The dictionary (dtype_dict) looks like this:
{'TIME': 'float64', 'ST1F': 'UInt16', 'ST2F': 'UInt8', 'ST3F': 'UInt8', 'ST4F': 'UInt8', 'P12V': 'float64'}
Go easy on me if i haven't provided enough info, its my first post and I am new to this :)
Update: Thanks for the help, it seems my dictionary is not quite formatted as it should be. I've tried all the dictionary types, I thought records was the closest but when i compare i can see i have square braces encloses the dictionary from the df.
d = dtype_df.to_dict('records')
d1 = {
'TIME': 'float64',
'ST1F': 'UInt16',
'ST2F': 'UInt8',
'ST3F': 'UInt8',
'ST4F': 'UInt8',
'P12V': 'float64'}
Output:
[{'TIME': 'float64', 'ST1F': 'UInt16', 'ST2F': 'UInt8', 'ST3F': 'UInt8', 'ST4F': 'UInt8', 'P12V': 'float64'}]
{'TIME': 'float64', 'ST1F': 'UInt16', 'ST2F': 'UInt8', 'ST3F': 'UInt8', 'ST4F': 'UInt8', 'P12V': 'float64'}
The dataframe i am generating the dictionary is in this form:
TIME ST1F ST2F ST3F ST4F P12V
0 float64 UInt16 UInt8 UInt8 UInt8 float64
Upvotes: 0
Views: 1773
Reputation: 111
The issue was, I was outputting my dictionary as a list containing a dictionary, there might be a more elegant way to get around this, but i fixed it by passing the index[0] for the list containing the dictionary.
df = pd.read_csv(filename, skiprows=range(1,5), dtype=d[0])
EDIT: worked it out then saw a comment posted earlier with the answer :)
Upvotes: 0
Reputation: 18201
Your code works exactly as expected here. Are you sure you are using 0.24.0 or 0.24.1?
In [27]: pd.read_csv('test.csv').head() # Spits out floats
Out[27]:
TIME ST1F ST2F ST3F ST4F P12V
0 43476.34869 32768.0 1.0 0.0 0.0 12.161
1 43476.34870 NaN NaN NaN NaN NaN
2 43476.34872 NaN NaN NaN NaN NaN
3 43476.34873 NaN NaN NaN NaN NaN
4 43476.34874 NaN NaN NaN NaN NaN
In [28]: pd.read_csv('test.csv', dtype=d).head() # Has the proper types
Out[28]:
TIME ST1F ST2F ST3F ST4F P12V
0 43476.34869 32768 1 0 0 12.161
1 43476.34870 NaN NaN NaN NaN NaN
2 43476.34872 NaN NaN NaN NaN NaN
3 43476.34873 NaN NaN NaN NaN NaN
4 43476.34874 NaN NaN NaN NaN NaN
In [29]: d # Let's check that this dict matches yours on the nose
Out[29]:
{'TIME': 'float64',
'ST1F': 'UInt16',
'ST2F': 'UInt8',
'ST3F': 'UInt8',
'ST4F': 'UInt8',
'P12V': 'float64'}
In [30]: pd.read_csv('test.csv', dtype=d).info() # And the types look good as well
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 6 columns):
TIME 14 non-null float64
ST1F 2 non-null UInt16
ST2F 2 non-null UInt8
ST3F 2 non-null UInt8
ST4F 2 non-null UInt8
P12V 2 non-null float64
dtypes: UInt16(1), UInt8(3), float64(2)
memory usage: 430.0 bytes
In [32]: pd.__version__
Out[32]: '0.24.0'
Upvotes: 2