Specify converter for Pandas index column in read_csv

Question

I am attempting to read in a CSV file with hexadecimal numbers in the index column:

InputBits, V0, V1, V2, V3
7A, 0.000594457716, 0.000620631282, 0.000569834178, 0.000625374384, 
7B, 0.000601155649, 0.000624282078, 0.000575955914, 0.000632111367, 
7C, 0.000606026872, 0.000629149805, 0.000582689823, 0.000634561234, 
7D, 0.000612115902, 0.000634625998, 0.000584526357, 0.000638235952, 
7E, 0.000615769413, 0.000637668328, 0.000590648093, 0.00064987256, 
7F, 0.000620640637, 0.000643144494, 0.000594933308, 0.000650485013,

I can do it using the following code:

df = pd.read_csv('data.csv', index_col=False,
                 converters={'InputBits': lambda x: int(x, 16)})
df.set_index('InputBits', inplace=True)

The problem is that this seems unnecessarily clunky. Is there a way to do something equivalent to the following?

df = pd.read_csv('data.csv', converters={'InputBits': lambda x: int(x, 16)})

This fails because InputBits is now the first data column with

ValueError: invalid literal for int() with base 16: ' 0.000594457716'

Mad Physicist · Accepted Answer

As @root pointed out here, the issue in this example is the misalignment of the header with the column names and the column values, which all have a trailing comma. In fact, the documentation deals with this specific scenario:

If you have a malformed file with delimiters at the end of each line, you might consider index_col=False to force pandas to not use the first column as the index (row names)

The solution here was first to run

sed -i 's/, \r$//' data.csv

to get rid of the final commas (and Windows line endings). Then, the expected command works almost out of the box:

pd.read_csv('data.csv', index_col='InputBits',
             converters={'InputBits': lambda x: int(x, 16)})

Specify converter for Pandas index column in read_csv

Answers (1)

Related Questions