Reputation: 8940
I am importing an excel file into a pandas dataframe with the pandas.read_excel()
function.
One of the columns is the primary key of the table: it's all numbers, but it's stored as text (the little green triangle in the top left of the Excel cells confirms this).
However, when I import the file into a pandas dataframe, the column gets imported as a float. This means that, for example, '0614' becomes 614.
Is there a way to specify the datatype when importing a column? I understand this is possible when importing CSV files but couldn't find anything in the syntax of read_excel()
.
The only solution I can think of is to add an arbitrary letter at the beginning of the text (converting '0614' into 'A0614') in Excel, to make sure the column is imported as text, and then chopping off the 'A' in python, so I can match it to other tables I am importing from SQL.
Upvotes: 154
Views: 338930
Reputation: 1
I am surprised that none of the other answers have mentioned read_excel's N/A parsing. Certain strings (‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘None’, ‘n/a’, ‘nan’, ‘null’) get interpreted as NaN, even when dtype or converters are set. To get exactly the text in the Excel sheet, you can set na_filter=False.
df = pd.read_excel('file.xlsx', dtype={'col1':str, 'col2':str}, na_filter=False)
Pandas documentation here.
Upvotes: 0
Reputation: 25544
converters
or dtype
won't always help. Especially for date/time and duration (ideally a mix of both...), post-processing is necessary. In such cases, reading the Excel file's content to a built-in type and create the DataFrame from that can be an option.
Here's an example file. The "duration" column contains duration values in HH:MM:SS and invalid values "-".
If the hour is less than 24, Excel formats the entry to a time, otherwise a duration. We want dtype timedelta
for the whole column in the DataFrame. But pandas
messes up the import:
import pandas as pd
df = pd.read_excel("path-to-file")
df.duration
# 0 12:30:00
# 1 1900-01-01 00:30:00
# 2 -
# 3 1900-01-01 23:55:01
# Name: duration, dtype: object
[type(i) for i in df.duration]
# [datetime.time, datetime.datetime, str, datetime.datetime]
Now we have datetime.datetime and datetime.time objects, and it's difficult to get back duration (timedelta)! You could do it directly with a converter
, but that does not make it less difficult.
Here, I found it to be actually easier to use the excel loader engine directly:
from openpyxl import load_workbook
wb = load_workbook('path-to-file')
sheet = wb['Tests'] # adjust sheet name, this is for the demo file
data = list(sheet.values) # a list of tuples, one tuple for each row
df = pd.DataFrame(data[1:], columns=data[0]) # first tuple is column names
df['duration']
# 0 12:30:00
# 1 1 day, 0:30:00
# 2 -
# 3 1 day, 23:55:01
# Name: duration, dtype: object
[type(i) for i in df['duration']]
# [datetime.time, datetime.timedelta, str, datetime.timedelta]
So now we already have some timedelta objects! The conversion of the others to timedelta can be done as simple as
df['duration'] = pd.to_timedelta(df.duration.astype(str), errors='coerce')
df['duration']
# 0 0 days 12:30:00
# 1 1 days 00:30:00
# 2 NaT
# 3 1 days 23:55:01
# Name: duration, dtype: timedelta64[ns]
Upvotes: 2
Reputation: 378
If you are able to read the excel file correctly and only the integer values are not showing up. you can specify like this.
df = pd.read_excel('my.xlsx',sheetname='Sheet1', engine="openpyxl", dtype=str)
this should change your integer values into a string and show in dataframe
Upvotes: 14
Reputation: 527
If you don't know the column names and you want to specify str data type to all columns:
table = pd.read_excel("path_to_filename")
cols = table.columns
conv = dict(zip(cols ,[str] * len(cols)))
table = pd.read_excel("path_to_filename", converters=conv)
Upvotes: 6
Reputation: 164613
If your key has a fixed number of digits, you should probably store as text rather than as numeric data. You can use the converters
argument or read_excel
for this.
Or, if this does not work, just manipulate your data once it's read into your dataframe:
df['key_zfill'] = df['key'].astype(str).str.zfill(4)
names key key_zfill
0 abc 5 0005
1 def 4962 4962
2 ghi 300 0300
3 jkl 14 0014
4 mno 20 0020
Upvotes: 4
Reputation: 397
In case if you are not aware of the number and name of columns in dataframe then this method can be handy:
column_list = []
df_column = pd.read_excel(file_name, 'Sheet1').columns
for i in df_column:
column_list.append(i)
converter = {col: str for col in column_list}
df_actual = pd.read_excel(file_name, converters=converter)
where column_list is the list of your column names.
Upvotes: 13
Reputation: 29711
Starting with v0.20.0
, the dtype
keyword argument in read_excel()
function could be used to specify the data types that needs to be applied to the columns just like it exists for read_csv()
case.
Using converters
and dtype
arguments together on the same column name would lead to the latter getting shadowed and the former gaining preferance.
1) Inorder for it to not interpret the dtypes
but rather pass all the contents of it's columns as they were originally in the file before, we could set this arg to str
or object
so that we don't mess up our data. (one such case would be leading zeros in numbers which would be lost otherwise)
pd.read_excel('file_name.xlsx', dtype=str) # (or) dtype=object
2) It even supports a dict mapping wherein the keys
constitute the column names and values
it's respective data type to be set especially when you want to alter the dtype
for a subset of all the columns.
# Assuming data types for `a` and `b` columns to be altered
pd.read_excel('file_name.xlsx', dtype={'a': np.float64, 'b': np.int32})
Upvotes: 114
Reputation: 6253
You just specify converters. I created an excel spreadsheet of the following structure:
names ages
bob 05
tom 4
suzy 3
Where the "ages" column is formatted as strings. To load:
import pandas as pd
df = pd.read_excel('Book1.xlsx',sheetname='Sheet1',header=0,converters={'names':str,'ages':str})
>>> df
names ages
0 bob 05
1 tom 4
2 suzy 3
Upvotes: 202
Reputation: 787
The read_excel() function has a converters argument, where you can apply functions to input in certain columns. You can use this to keep them as strings. Documentation:
Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the Excel cell content, and return the transformed content.
Example code:
pandas.read_excel(my_file, converters = {my_str_column: str})
Upvotes: 10