RandyB
RandyB

Reputation: 133

pd.read_csv with index_col= argument truncates leading zeros

SO has several answers to the question how to avoid truncating leading zeros with pd.read_csv, like this one.

My question is how to avoid truncation of leading zeros with the index_col= argument to the pd.read_csv method. In this example, the ID column has leading zeros.

>>> import pandas as pd
>>> miss = {'Amount' : [' ', 'NA']}
>>> url = "https://raw.githubusercontent.com/RandyBetancourt/PythonForSASUsers/master/data/messy_input.csv"

>>> d1 = pd.read_csv(url, skiprows=2, na_values=miss, dtype={'ID' : object})
>>> print(d1)
 ID       Date   Amount  Quantity   Status Unnamed: 5
0  0042  16-Oct-17  $23.99      123.0   Closed     Jansen
1  7731  15-Jan-17  $49.99        NaN  Pending        Rho
2  8843   9-Mar-17      129      45.0      NaN      Gupta
3  3013  12-Feb-17      NaN      15.0  Pending   Harrison
4  4431   1-Jul-17  $99.99        1.0   Closed       Yang
>>> print(d1.dtypes)
ID             object
Date           object
Amount         object
Quantity      float64
Status         object
Unnamed: 5     object
dtype: object

In a subsequent read with the index_col= argument, the index strips the leading zeros.

>>> miss = {'Amount' : [' ', 'NA']}
>>> url = "https://raw.githubusercontent.com/RandyBetancourt/PythonForSASUsers/master/data/messy_input.csv"
>>> d1 = pd.read_csv(url, skiprows=2, na_values=miss, converters={'ID' : 
str}, index_col='ID')
>>> print(d1)
       Date   Amount  Quantity   Status Unnamed: 5
ID
42    16-Oct-17  $23.99      123.0   Closed     Jansen
7731  15-Jan-17  $49.99        NaN  Pending        Rho
8843   9-Mar-17      129      45.0      NaN      Gupta
3013  12-Feb-17      NaN      15.0  Pending   Harrison
4431   1-Jul-17  $99.99        1.0   Closed       Yang
>>> print(d1.dtypes)
Date           object
Amount         object
Quantity      float64
Status         object
Unnamed: 5     object
dtype: object
>>> d1.index
Int64Index([42, 7731, 8843, 3013, 4431], dtype='int64', name='ID')

How do I keep the leading zeros, using just the pd.read_csv method? I know I can read without the index_col= argument and set the index after the read and obtain the desired results.

Upvotes: 0

Views: 713

Answers (1)

Dani G
Dani G

Reputation: 1242

Your only option would be to set the index after parsing.

d1 = pd.read_csv(url, skiprows=2, na_values=miss, converters={'ID' : 
str}).set_index('ID')

This has been an open issue in pandas for a while and still hasn't been fixed.

Upvotes: 1

Related Questions