Reputation: 8970

Python pandas: how to specify data types when reading an Excel file?

I am importing an excel file into a pandas dataframe with the pandas.read_excel() function.

One of the columns is the primary key of the table: it's all numbers, but it's stored as text (the little green triangle in the top left of the Excel cells confirms this).

However, when I import the file into a pandas dataframe, the column gets imported as a float. This means that, for example, '0614' becomes 614.

Is there a way to specify the datatype when importing a column? I understand this is possible when importing CSV files but couldn't find anything in the syntax of read_excel().

The only solution I can think of is to add an arbitrary letter at the beginning of the text (converting '0614' into 'A0614') in Excel, to make sure the column is imported as text, and then chopping off the 'A' in python, so I can match it to other tables I am importing from SQL.

Upvotes: 154

Answers (9)

x83kda

Reputation: 11

I am surprised that none of the other answers have mentioned read_excel's N/A parsing. Certain strings (‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘None’, ‘n/a’, ‘nan’, ‘null’) get interpreted as NaN, even when dtype or converters are set. To get exactly the text in the Excel sheet, you can set na_filter=False.

df = pd.read_excel('file.xlsx', dtype={'col1':str, 'col2':str}, na_filter=False)

Pandas documentation here.

Upvotes: 1

FObersteiner

Reputation: 25634

converters or dtype won't always help. Especially for date/time and duration (ideally a mix of both...), post-processing is necessary. In such cases, reading the Excel file's content to a built-in type and create the DataFrame from that can be an option.

Here's an example file. The "duration" column contains duration values in HH:MM:SS and invalid values "-".

If the hour is less than 24, Excel formats the entry to a time, otherwise a duration. We want dtype timedelta for the whole column in the DataFrame. But pandas messes up the import:

import pandas as pd
df = pd.read_excel("path-to-file")

df.duration
# 0               12:30:00
# 1    1900-01-01 00:30:00
# 2                      -
# 3    1900-01-01 23:55:01
# Name: duration, dtype: object

[type(i) for i in df.duration]
# [datetime.time, datetime.datetime, str, datetime.datetime]

Now we have datetime.datetime and datetime.time objects, and it's difficult to get back duration (timedelta)! You could do it directly with a converter, but that does not make it less difficult.

Here, I found it to be actually easier to use the excel loader engine directly:

from openpyxl import load_workbook
wb = load_workbook('path-to-file')

sheet = wb['Tests'] # adjust sheet name, this is for the demo file
data = list(sheet.values) # a list of tuples, one tuple for each row
df = pd.DataFrame(data[1:], columns=data[0]) # first tuple is column names

df['duration']
# 0           12:30:00
# 1     1 day, 0:30:00
# 2                  -
# 3    1 day, 23:55:01
# Name: duration, dtype: object

[type(i) for i in df['duration']]
# [datetime.time, datetime.timedelta, str, datetime.timedelta]

So now we already have some timedelta objects! The conversion of the others to timedelta can be done as simple as

df['duration'] = pd.to_timedelta(df.duration.astype(str), errors='coerce')
df['duration']
# 0   0 days 12:30:00
# 1   1 days 00:30:00
# 2               NaT
# 3   1 days 23:55:01
# Name: duration, dtype: timedelta64[ns]

Upvotes: 2

Rajat Tyagi

Reputation: 388

If you are able to read the excel file correctly and only the integer values are not showing up. you can specify like this.

df = pd.read_excel('my.xlsx',sheetname='Sheet1', engine="openpyxl", dtype=str)

this should change your integer values into a string and show in dataframe

Upvotes: 14

HazimoRa3d

Reputation: 547

If you don't know the column names and you want to specify str data type to all columns:

table = pd.read_excel("path_to_filename")
cols = table.columns
conv = dict(zip(cols ,[str] * len(cols)))
table = pd.read_excel("path_to_filename", converters=conv)

Upvotes: 6

jpp

Reputation: 164783

If your key has a fixed number of digits, you should probably store as text rather than as numeric data. You can use the converters argument or read_excel for this.

Or, if this does not work, just manipulate your data once it's read into your dataframe:

df['key_zfill'] = df['key'].astype(str).str.zfill(4)

  names   key key_zfill
0   abc     5      0005
1   def  4962      4962
2   ghi   300      0300
3   jkl    14      0014
4   mno    20      0020

Upvotes: 4

Tango

Reputation: 397

In case if you are not aware of the number and name of columns in dataframe then this method can be handy:

column_list = []
df_column = pd.read_excel(file_name, 'Sheet1').columns
for i in df_column:
    column_list.append(i)
converter = {col: str for col in column_list} 
df_actual = pd.read_excel(file_name, converters=converter)

where column_list is the list of your column names.

Upvotes: 13

Nickil Maveli

Reputation: 29719

Starting with v0.20.0, the dtype keyword argument in read_excel() function could be used to specify the data types that needs to be applied to the columns just like it exists for read_csv() case.

Using converters and dtype arguments together on the same column name would lead to the latter getting shadowed and the former gaining preferance.

1) Inorder for it to not interpret the dtypes but rather pass all the contents of it's columns as they were originally in the file before, we could set this arg to str or object so that we don't mess up our data. (one such case would be leading zeros in numbers which would be lost otherwise)

pd.read_excel('file_name.xlsx', dtype=str)            # (or) dtype=object

2) It even supports a dict mapping wherein the keys constitute the column names and values it's respective data type to be set especially when you want to alter the dtype for a subset of all the columns.

# Assuming data types for `a` and `b` columns to be altered
pd.read_excel('file_name.xlsx', dtype={'a': np.float64, 'b': np.int32})

Upvotes: 114

tnknepp

Reputation: 6263

You just specify converters. I created an excel spreadsheet of the following structure:

names   ages
bob     05
tom     4
suzy    3

Where the "ages" column is formatted as strings. To load:

import pandas as pd

df = pd.read_excel('Book1.xlsx',sheetname='Sheet1',header=0,converters={'names':str,'ages':str})
>>> df
       names ages
   0   bob   05
   1   tom   4
   2   suzy  3

Upvotes: 202

Nix G-D

Reputation: 787

The read_excel() function has a converters argument, where you can apply functions to input in certain columns. You can use this to keep them as strings. Documentation:

Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the Excel cell content, and return the transformed content.

Example code:

pandas.read_excel(my_file, converters = {my_str_column: str})

Upvotes: 10

Python pandas: how to specify data types when reading an Excel file?

Answers (9)

Related Questions