Reputation: 693
UPDATE:
I have managed to find the error source: In the current version of pandas, dataframes with the column 'object' dtype no longer use the scientific notation. For big values the cells display the right significant figures, but for small numbers the displayed value is 0.0
.
If you access the cell from the running script you still get the correct value. The issue is that if you store the dataframe, as a text file for example, you save the incorrect value.
This is a code example with the correct (for me) behaviour in previous versions:
import pandas as pd
print(f'pandas version {pd.__version__}')
idx = 'H1_6563A'
data = {'ion': 'H1',
'wavelength': 6563.0,
'latex_label': '$6563\AA\,HI$',
'intgr_flux': 3.128572e-14,
'dist': 2.8e20,
'eqw': 1464.05371}
mySeries = pd.Series(index=data.keys(), dtype='object')
for param, value in data.items():
mySeries[param] = value
print(f'\nSeries: \n {mySeries}')
myDF = pd.DataFrame(columns=data.keys())
myDF.loc[idx] = mySeries
print(f'\nDataFrame:\n {myDF}')
Where the dataframe shows a combination of scientific and non-scientific floats:
pandas version 1.2.3
Series:
ion H1
wavelength 6563.0
latex_label $6563\AA\,HI$
intgr_flux 0.0
dist 280000000000000000000.0
eqw 1464.05371
dtype: object
DataFrame:
ion wavelength latex_label intgr_flux dist eqw
H1_6563A H1 6563.0 $6563\AA\,HI$ 3.128572e-14 2.800000e+20 1464.05371
The same script in pandas 1.4.1 returns:
pandas version 1.4.1
Series:
ion H1
wavelength 6563.0
latex_label $6563\AA\,HI$
intgr_flux 0.0
dist 280000000000000000000.0
eqw 1464.05371
dtype: object
DataFrame:
ion wavelength latex_label intgr_flux dist eqw
H1_6563A H1 6563.0 $6563\AA\,HI$ 0.0 280000000000000000000.0 1464.05371
I wonder if anyone would please share their approaches to replicate the original behaviour so I can have a dataframe with mixed variables (strings, ints, floats, None, scientific, non-scientific) and show the correct significant figures.
Thank you very much.
ORIGINAL QUESTION
I am using a pandas.Series as a container for entries of different types. I have noticed the following issue while declaring small floats in scientific notation:
import numpy as np
import pandas as pd
print(f'Pandas {pd.__version__}')
columns = ['c0', 'c1', 'c2', 'c3']
mySeries = pd.Series(index=columns)
mySeries['c0'] = 'None'
mySeries['c1'] = np.nan
mySeries['c2'] = 1234.0
mySeries['c3'] = 1.234e-18
print(mySeries)
which returns:
c0 None
c1 NaN
c2 1234.0
c3 0.0
dtype: object
Calling the 'c3' entry the returns the complete float, however, if you convert this series to a pandas.DataFrame and you save it to a text file (using the .to_string()
attribute) it will be stored as 0.0.
If your first entry is a float this does not happen:
columns = ['c0', 'c1', 'c2', 'c3']
mySeries = pd.Series(index=columns)
mySeries['c0'] = 123
mySeries['c1'] = np.nan
mySeries['c2'] = 1234.0
mySeries['c3'] = 1.234e-18
print(mySeries)
c0 1.230000e+02
c1 NaN
c2 1.234000e+03
c3 1.234000e-18
dtype: float64
So my question is: Which is the right way to declare the input variable dtype so the entry order does not affect the display. Moreover, I wonder if anyone knows which is the parameter which decides when a cell uses the scientific notation or not.
Thanks a lot.
Upvotes: 1
Views: 693
Reputation: 3967
I would shape my df first, with proper dtypes, then add the data:
import pandas as pd
df = pd.DataFrame(
{'ion': pd.Series(dtype='str'),
'wavelength': pd.Series(dtype='float'),
'intgr_flux': pd.Series(dtype='float')})
idx = 'H1_6563A'
data = {
'ion': 'H1',
'wavelength': 6563.0,
'intgr_flux': 3.128572e-14}
df.loc[idx] = data
print(df)
# Outputs:
# ion wavelength intgr_flux
# H1_6563A H1 6563.0 3.128572e-14
Upvotes: 1
Reputation: 1264
You should specify the dtype when you create the series, e.g.
mySeries = pd.Series(index=columns, dtype='float64')
If you don't it will be inferred, probably from the first value you set.
See https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas-series for more details.
UPDATE
It looks like when a python float is stored as dtype 'object' the string representation doesn't use scientific notation and truncates to about six digits after the decimal. That might be a bug depending on your point of view.
The difference in behavior you observed between your two series though is due to your second series containing all numbers so the dtype is 'float64'. Your first series contains 'None', which is a string, so the dtype is 'object'. If you must store both strings and numbers in the same series I would use the 'string' dtype and store everything as strings.
Upvotes: 1