Reputation: 433
I've stumbled upon a small issue when using pandas DataFrame:
I have a big csv file (around 2Gb of data) containing the price of an asset and created using the DataFrame.to_csv()
function of Pandas, and when I take a closer inspection of the code, my first lines look like this:
DateTime,open,high,low,close
2016-01-04 00:36:18,1.08505,1.08505,1.08504,1.08504
2016-01-04 00:36:19,1.08505,1.08505,1.08504,1.08504
2016-01-04 00:36:20,1.08503,1.08503,1.08495,1.08495
2016-01-04 00:36:21,1.0849600000000001,1.0849600000000001,1.0849600000000001,1.0849600000000001
2016-01-04 00:36:22,1.0849600000000001,1.0849600000000001,1.08492,1.08492
The data was created using the df.resample('1s').ohlc()
and I thought that sometimes there was a few rounding issue, so I tried to round the DataFrame using df.round(5)
to keep the last 5 decimals, but it doesn't change anything at all.
SEC = pd.read_csv("D:\Finance python\Data\EUR_USD\Sec\S1_2015.csv",index_col='DateTime',parse_dates=True,error_bad_lines=False,infer_datetime_format=True)
SEC = SEC.round(5)
The DataFrame stays the same, and I truly wonder why.
When I try it with a csv file containing the 5 rows I gave above:
In[13]: SEC["open"][3]
Out[13]: 1.0849599999999999
It's not an issue when doing calculation over the df (even though it might be faster the less decimals there is), but it seems like a lot of 0 or 9 are being stored in my csv files for nothing, and are taking extra space.
It also seems that even value that look fine in the csv file, are actually not well rounded when called with pandas.
Would anyone have an idea of why the DataFrame are not being rounded properly, or of a solution to have shorter csv files when I save them with pandas?
Thanks in advance
Edit: I tried to use the Decimal method, but it still doesn't work. I believe that it is because pandas is not able to store Decimal type numbers in dataframes, thus converting it to a float.
Upvotes: 2
Views: 7350
Reputation: 433
I found what the issue was on another post: float64 with pandas to_csv
I need to use the argument float_format='%.5f
to have my csv file as I want them, the issue is linked to the way float number work.
Upvotes: 3
Reputation: 3130
This has to do with the precision of floating point arithmetic; not all numbers can be represented exactly. If you want to set pandas to display numbers to 5 decimal places, you can do
pd.set_options('display.float_format','{:.5f}')
but the internal representation will stay the same (which after reading your post closely, will not solve your issues with the csv).
If you want to change the internal representation, you need to use a Decimal
data type:
from decimal import Decimal
df.round(5).astype(Decimal)
FYI, you can reproduce your problem like this:
import numpy as np
np.float64(1.08496)
Upvotes: 3
Reputation: 651
I try to run your code:
df = pd.read_clipboard(sep=',',engine='python')
df
DateTime open high low close
0 2016-01-04 00:36:18 1.08505 1.08505 1.08504 1.08504
1 2016-01-04 00:36:19 1.08505 1.08505 1.08504 1.08504
2 2016-01-04 00:36:20 1.08503 1.08503 1.08495 1.08495
and then use df.round(3)
DateTime open high low close
0 2016-01-04 00:36:18 1.085 1.085 1.085 1.085
1 2016-01-04 00:36:19 1.085 1.085 1.085 1.085
2 2016-01-04 00:36:20 1.085 1.085 1.085 1.085
It works for me ,but as I df.astype(str).round(3)
and then it will not work .So I recommend you to check the type
of your data.
Upvotes: 0