Reputation: 133
SO has several answers to the question how to avoid truncating leading zeros with pd.read_csv, like this one.
My question is how to avoid truncation of leading zeros with the index_col= argument to the pd.read_csv method. In this example, the ID column has leading zeros.
>>> import pandas as pd
>>> miss = {'Amount' : [' ', 'NA']}
>>> url = "https://raw.githubusercontent.com/RandyBetancourt/PythonForSASUsers/master/data/messy_input.csv"
>>> d1 = pd.read_csv(url, skiprows=2, na_values=miss, dtype={'ID' : object})
>>> print(d1)
ID Date Amount Quantity Status Unnamed: 5
0 0042 16-Oct-17 $23.99 123.0 Closed Jansen
1 7731 15-Jan-17 $49.99 NaN Pending Rho
2 8843 9-Mar-17 129 45.0 NaN Gupta
3 3013 12-Feb-17 NaN 15.0 Pending Harrison
4 4431 1-Jul-17 $99.99 1.0 Closed Yang
>>> print(d1.dtypes)
ID object
Date object
Amount object
Quantity float64
Status object
Unnamed: 5 object
dtype: object
In a subsequent read with the index_col= argument, the index strips the leading zeros.
>>> miss = {'Amount' : [' ', 'NA']}
>>> url = "https://raw.githubusercontent.com/RandyBetancourt/PythonForSASUsers/master/data/messy_input.csv"
>>> d1 = pd.read_csv(url, skiprows=2, na_values=miss, converters={'ID' :
str}, index_col='ID')
>>> print(d1)
Date Amount Quantity Status Unnamed: 5
ID
42 16-Oct-17 $23.99 123.0 Closed Jansen
7731 15-Jan-17 $49.99 NaN Pending Rho
8843 9-Mar-17 129 45.0 NaN Gupta
3013 12-Feb-17 NaN 15.0 Pending Harrison
4431 1-Jul-17 $99.99 1.0 Closed Yang
>>> print(d1.dtypes)
Date object
Amount object
Quantity float64
Status object
Unnamed: 5 object
dtype: object
>>> d1.index
Int64Index([42, 7731, 8843, 3013, 4431], dtype='int64', name='ID')
How do I keep the leading zeros, using just the pd.read_csv method? I know I can read without the index_col= argument and set the index after the read and obtain the desired results.
Upvotes: 0
Views: 713
Reputation: 1242
Your only option would be to set the index after parsing.
d1 = pd.read_csv(url, skiprows=2, na_values=miss, converters={'ID' :
str}).set_index('ID')
This has been an open issue in pandas for a while and still hasn't been fixed.
Upvotes: 1