Reputation: 1270
I obtained an Excel file with complicated formatting for some cells. Here is a sample:
The "USDC Amount USDC" column has formatting of "General" for the header cell, and the following for cells C2 through C6:
I need to read this column into pandas as a float value. However, when I use
import pandas
df = pandas.read_excel('Book1.xlsx')
print(['USDC Amount USDC'])
print(df['USDC Amount USDC'])
I get
['USDC Amount USDC']
0 NaT
1 1927-06-05 05:38:32.726400
2 1872-07-25 18:21:27.273600
3 NaT
4 NaT
Name: USDC Amount USDC, dtype: datetime64[ns]
I do not want these as datetimes, I want them as floats! If I remove the complicated formatting in the Excel document (change it to "general" in column C), they are read in as float values, like this, which is what I want:
['USDC Amount USDC']
0 NaN
1 10018.235101
2 -10018.235101
3 NaN
4 NaN
Name: USDC Amount USDC, dtype: float64
The problem is that I have to download these Excel documents on a regular basis, and cannot modify them from the source. I have to get Pandas to understand (or ignore) this formatting and interpret the value as a float on its own.
I'm on Pandas 1.4.4, Windows 10, and Python 3.8. Any idea how to fix this? I cannot change the source Excel file, all the processing must be done in the Python script.
EDIT:
I added the sample Excel document in my comment below to download for reference. Also, here are some other package versions in case these matter:
openpyxl==3.0.3
xlrd==1.2.0
XlsxWriter==1.2.8
Upvotes: 0
Views: 653
Reputation: 1270
It appears updating OpenPyXL from 3.0.3
to 3.1.0
resolved this issue. A quick glance at the changelog (https://openpyxl.readthedocs.io/en/stable/changes.html) suggests it appears to be related to bugfix 1413 or 1500.
Upvotes: 1
Reputation: 856
You could use the dtype input in read_excel to be along the lines of
import numpy as np
df = pandas.read_excel('Book1.xlsx', dtype={'USDC Amount USDC':np.float64})
but that comes with some issues. Particularly, your source data contains characters that can't be casted into a float. Your next best options are the object or string dtypes. So instead of :np.float64, you would do something like :"string" instead, resulting in
df = pandas.read_excel('Book1.xlsx', dtype={'USDC Amount USDC':"string"})
After that, you need to extract the numbers from the column. Here's a resource that could help you get an idea of the overall process, although the exact method of doing so is up to you.
Finally, you would want to convert the now numbers-only column to floats. You can do it with the inbuilt casting which is
df["numbers_only"] = df["numbers_only"].astype(np.float64)
Upvotes: 0