Reputation: 667
I have a pandas dataframe (df).
df has plenty of columns and rows, many of which are integers.
My intention is to save the dataframe as an excel file and read it back again while retaining the integrity of the data.
I'm using the following steps.
writer = pd.ExcelWriter("myExcelFile.xlsx")
df.to_excel(writer, 'sheet_name')
writer.save()
import glob
files = glob.glob("myExcelFile*.xlsx") #gives list of files
myFile = files[0]
df = pd.read_excel(myFile , sheetname='sheet_name', convert_float=True)
Please note the option "convert_float". Supposedly, excel saves all numbers in float format. So this option is supposed to help convert all the float values to possible integers.
For instance, 1.0 -> 1.
My requirement is to fetch the original integer values that I intended to save in this excel sheet and retrieve later. However this doesn't work for some reason. Am I going wrong somewhere ?
Is there a way I can handle that during saving to excel ?
I've tried to tackle this by mutating integers to strings, storing strings in excel, reading strings back from excel, reconverting to integers. But, the pain is too severe both for me and my app :-/
Upvotes: 1
Views: 6411
Reputation: 20791
I can't replicate your problem. It seems to work fine for me:
import pandas as pd
df = pd.DataFrame({'Floats': [10.1, 20.2, 30.3, 20.0, 15.9, 30.1, 45.0],
'Integers': [10.0, 20.0, 30, 20, 15, 30, 45]})
filename = 'df.xlsx'
writer = pd.ExcelWriter(filename)
df.to_excel(writer)
writer.save()
df = pd.read_excel(filename, convert_float=True)
print df
Result:
Floats Integers
0 10.1 10
1 20.2 20
2 30.3 30
3 20.0 20
4 15.9 15
5 30.1 30
6 45.0 45
Do you get the same result when you run this code? If so, then there must be something else going on. Can you give us code that demonstrates the problem?
Note that each column that has at least one float in it will make the whole column be treated as floats, because you can't usually have multiple datatypes in a given column (see below regarding the object
column type).
One workaround if the above code doesn't work for some reason would be to force certain columns and/or the index to be integers manually, like this:
df = pd.read_excel(filename) # convert_float=False by default
df['Integers'] = df['Integers'].astype(int)
df.index = df.index.astype(int)
print df
And you could force all columns to be integers like this:
df = pd.read_excel(filename).astype(int)
Edit after OP gave more detail:
If you know which columns need to be treated as strings, you can use the same same manual technique from above:
df['Strings'] = df['Strings'].astype(str)
But you want it to be more automatic. This is hacky, but it works. If you add a dummy string to the end of your data that is blatantly a string, like 'dummy'
, then pandas will bring the column in as objects, with each element having its own datatype. Without the dummy string, it doesn't work. You can try the commented out dataframe in my code to see.
import pandas as pd
# This works.
df = pd.DataFrame({'Floats': [10.1, 20.2, 30.3, 20.0, 15.9, 30.1, 0],
'Objects': ['10.0', 20.0, 30.5, 20, 15, 30, 'dummy']})
# This doesn't work.
# df = pd.DataFrame({'Floats': [10.1, 20.2, 30.3, 20.0, 15.9, 30.1],
# 'Objects': ['10.0', 20.0, 30.5, 20, 15, 30]})
filename = 'df.xlsx'
writer = pd.ExcelWriter(filename)
df.to_excel(writer)
writer.save()
# Remove the dummy row.
df = pd.read_excel(filename)[:-1]
print df
print
print df.dtypes
print
print df.loc[0, 'Objects'], type(df.loc[0, 'Objects'])
print df.loc[1, 'Objects'], type(df.loc[1, 'Objects'])
print df.loc[2, 'Objects'], type(df.loc[2, 'Objects'])
print df.loc[3, 'Objects'], type(df.loc[3, 'Objects'])
Result:
Floats Objects
0 10.1 10.0
1 20.2 20
2 30.3 30.5
3 20.0 20
4 15.9 15
5 30.1 30
Floats float64
Objects object
dtype: object
10.0 <type 'unicode'>
20 <type 'int'>
30.5 <type 'float'>
20 <type 'int'>
Upvotes: 1