physiker
physiker

Reputation: 919

pandas memory error while converting column of large csv from string to float

I have a large csv (~20 mil rows) and I'd like to convert one column from string to float. I do this way:

df['sale']=df['sale'].str.replace(",", ".").astype('float32')

and sale looks like:

86,2600
20,2800 
123,5000
30,7500
8,3600

The command seems unstable, i.e sometimes gives the following memory error:

MemoryError Traceback (most recent call last) in () ----> 1 df['sale']=df['sale'].str.replace(",", ".").astype('float32');

What is exactly this error and how can I fix it? Thank you!

Upvotes: 1

Views: 616

Answers (1)

EdChum
EdChum

Reputation: 394199

Rather than converting after loading which is a memory intensive operation. You can specify that the decimal separator is European style by passing the param decimal=',' to read_csv:

pd.read_csv(FILENAME, decimal=',')

Example:

In[24]:
t="""data
86,2600
20,2800 
123,5000
30,7500
8,3600"""
df = pd.read_csv(io.StringIO(t), decimal=',', sep=';')
df

Out[24]: 
     data
0   86.26
1   20.28
2  123.50
3   30.75
4    8.36

Note that I pass sep=';' otherwise it will treat the above as 2 columns as the default separator is comma.

We can see that the output shows that it's decimal, and we can confirm the dtype using .info():

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 1 columns):
data    5 non-null float64
dtypes: float64(1)
memory usage: 120.0 bytes

Upvotes: 2

Related Questions