Because I don't need double precision My machine has limited memory and I want to process bigger datasets I need to pass the extracted data (as matrix) to BLAS libraries, and BLAS calls for single precision are 2x faster than for double precision equivalence. Note that not all columns in the raw csv file have float types. I only need to set float32 as the default for float columns.

Reputation: 1020

How to force pandas read_csv to use float32 for all float columns?

Because

I don't need double precision
My machine has limited memory and I want to process bigger datasets
I need to pass the extracted data (as matrix) to BLAS libraries, and BLAS calls for single precision are 2x faster than for double precision equivalence.

Note that not all columns in the raw csv file have float types. I only need to set float32 as the default for float columns.

Upvotes: 26

Answers (5)

asbrana

Reputation: 1

I think it's slightly more efficient to call the dtypes, as opposed to jorijnsmit's solution...

jorijnsmit's:

%%timeit
df.astype({c: 'float32' for c in df.select_dtypes(include='float64').columns})
754 µs ± 6.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

calling dtypes:

%%timeit
df.astype({c: 'float32' for c in df.dtypes.index[df.dtypes == 'float64']})
538 µs ± 343 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Upvotes: 0

gosuto

Reputation: 5741

Here's a solution which does not depend on .join or does not require reading the file twice:

float64_cols = df.select_dtypes(include='float64').columns
mapper = {col_name: np.float32 for col_name in float64_cols}
df = df.astype(mapper)

Or for kicks as a one-liner:

df = df.astype({c: np.float32 for c in df.select_dtypes(include='float64').columns})

Upvotes: 7

Janosh

Reputation: 4662

If you don't care about column order, there's also df.select_dtypes which avoids having to read_csv twice:

import pandas as pd

df = pd.read_csv("file.csv")

df_float = df.select_dtypes(include=float).astype("float32")
df_not_float = df.select_dtypes(exclude=float)

df = df_float.join(df_not_float)

Or, if you want to convert all non-string columns (e.g. integer columns) to float:

import pandas as pd

df = pd.read_csv("file.csv")

df_not_str = df.select_dtypes(exclude=object).astype("float32")
df_str = df.select_dtypes(include=object)

df = df_not_str.join(df_str)

Upvotes: 1

Bstampe

Reputation: 749

@Alexander's is a great answer. Some columns may need to be precise. If so, you may need to stick more conditionals into your list comprehension to exclude some columns the any or all built ins are handy:

float_cols = [c for c in df_test if all([df_test[c].dtype == "float64", 
             not df_test[c].name == 'Latitude', not df_test[c].name =='Longitude'])]

Upvotes: 1

Alexander

Reputation: 109546

Try:

import numpy as np
import pandas as pd

# Sample 100 rows of data to determine dtypes.
df_test = pd.read_csv(filename, nrows=100)

float_cols = [c for c in df_test if df_test[c].dtype == "float64"]
float32_cols = {c: np.float32 for c in float_cols}

df = pd.read_csv(filename, engine='c', dtype=float32_cols)

This first reads a sample of 100 rows of data (modify as required) to determine the type of each column.

It the creates a list of those columns which are 'float64', and then uses dictionary comprehension to create a dictionary with these columns as the keys and 'np.float32' as the value for each key.

Finally, it reads the whole file using the 'c' engine (required for assigning dtypes to columns) and then passes the float32_cols dictionary as a parameter to dtype.

df = pd.read_csv(filename, nrows=100)
>>> df
   int_col  float1 string_col  float2
0        1     1.2          a     2.2
1        2     1.3          b     3.3
2        3     1.4          c     4.4

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 4 columns):
int_col       3 non-null int64
float1        3 non-null float64
string_col    3 non-null object
float2        3 non-null float64
dtypes: float64(2), int64(1), object(1)

df32 = pd.read_csv(filename, engine='c', dtype={c: np.float32 for c in float_cols})
>>> df32.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 4 columns):
int_col       3 non-null int64
float1        3 non-null float32
string_col    3 non-null object
float2        3 non-null float32
dtypes: float32(2), int64(1), object(1)

Upvotes: 26

How to force pandas read_csv to use float32 for all float columns?

Answers (5)

Related Questions