n1000
n1000

Reputation: 5314

Harmonise pandas MultiIndex to string when reading Excel files

I am reading data from an Excel file into a pandas DataFrame using read_excel(). Unfortunately, it seems difficult to ensure the formatting of cells in Excel and so it happens that a table like this:

       2018                2019
          a         b         a         b
0  1.295666 -0.544973  0.845973 -0.874668
1  0.590123  0.284364 -1.482706 -0.859350
2  0.832228  0.469992  0.994865  0.480301
3  0.098671  0.198643  0.878323 -0.119761

...actually has surprising indices or columns:

df.columns

MultiIndex(levels=[[2018, 2019, '2019'], ['a', 'b']],
           labels=[[0, 0, 1, 2], [0, 1, 0, 1]])

As you can see, the primary index of the last column actually has a string for 2019 and not an integer like the others.

To be on the safe side, I would like to convert all indices to string, but pandas wont let me:

df.columns.set_levels(df.columns.levels[0].astype(str), level=0)

ValueError: Level values must be unique: ['2018', '2019', '2019'] on level 0

I see two approaches to solve this:

  1. Have read_excel() convert column headers to string or
  2. Get set_levels() as in my example above to work.

But I can't get either to work - any hints?

Upvotes: 0

Views: 184

Answers (1)

BENY
BENY

Reputation: 323326

You can re-create your multiple index for columns

idx=pd.MultiIndex.from_product([df.columns.levels[0].astype(int).unique(),df.columns.levels[1]])
df.columns=idx
df.columns
MultiIndex(levels=[[2018, 2019], ['a', 'b']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

From op better layout format

df.columns = pd.MultiIndex.from_product([c.astype(str).unique() for c in df.columns.levels])

Update / Caveat

This solution can lead to some headache. data.columns.codes (formally know as data.columns.labels) does not necessarily come in an increasing order from read_excel(), e.g. FrozenList([[3, 3, 2, 2, 1, 1, 0, 0], [1, 0, 1, 0, 1, 0, 1, 0]]) can occur. When using the .from_product() approach here, this will cause trouble and change the order of the column names... A workaround is to save its state and write it back after the deed:

old_col_codes = df.columns.codes
df.columns = pd.MultiIndex.from_product([c.astype(str).unique() for c in df.columns.levels])
df.columns.set_codes(old_df_codes, inplace=True)

Upvotes: 1

Related Questions