JC_CL
JC_CL

Reputation: 2608

How to replace a string in a pandas multiindex?

I have a dataframe with a large multiindex, sourced from a vast number of csv files. Some of those files have errors in the various labels, ie. "window" is missspelled as "winZZw", which then causes problems when I select all windows with df.xs('window', level='middle', axis=1).

So I need a way to simply replace winZZw with window.

Here's a very minimal sample df: (lets assume the data and the 'roof', 'window'… strings come from some convoluted text reader)

header = pd.MultiIndex.from_product(['roof', 'window', 'basement'], names = ['top', 'middle', 'bottom'])
dates = pd.date_range('01/01/2000','01/12/2010', freq='MS')
data = np.random.randn(len(dates))
df = pd.DataFrame(data, index=dates, columns=header)
header2 = pd.MultiIndex.from_product(['roof', 'winZZw', 'basement'], names = ['top', 'middle', 'bottom'])
data = 3*(np.random.randn(len(dates)))
df2 = pd.DataFrame(data, index=dates, columns=header2)
df = pd.concat([df, df2], axis=1)
header3 = pd.MultiIndex.from_product(['roof', 'door', 'basement'], names = ['top', 'middle', 'bottom'])
data = 2*(np.random.randn(len(dates)))
df3 = pd.DataFrame(data, index=dates, columns=header3)
df = pd.concat([df, df3], axis=1)

Now I want to xs a new dataframe for all the houses that have a window at their middle level: windf = df.xs('window', level='middle', axis=1)

But this obviously misses the misspelled winZZw.

So, how I replace winZZw with window?

The only way I found was to use set_levels, but if I understood that correctly, I need to feed it the whole level, ie

df.columns.set_levels([u'window',u'window', u'door'], level='middle',inplace=True)

but this has two issues:

I can work around these issues by xsing a new df of only winZZws, and then setting the levels with set_levels(df.shape[1]*[u'window'], level='middle') and then concatting it together again, but I'd like to have something more straightforward analog to str.replace('winZZw', 'window'), but I can't figure out how.

Upvotes: 2

Views: 2295

Answers (2)

Clement H.
Clement H.

Reputation: 1428

A more general solution to replace a string within a multiindex is the following

df.columns = pd.MultiIndex.from_tuples([tuple([x.replace("to_replace", "new_str") for x in tuple_index]) for tuple_index in df.columns])

Upvotes: 0

jezrael
jezrael

Reputation: 862406

Use rename with specifying level:

header = pd.MultiIndex.from_product([['roof'],[ 'window'], ['basement']], names = ['top', 'middle', 'bottom'])
dates = pd.date_range('01/01/2000','01/12/2010', freq='MS')
data = np.random.randn(len(dates))
df = pd.DataFrame(data, index=dates, columns=header)
header2 = pd.MultiIndex.from_product([['roof'], ['winZZw'], ['basement']], names = ['top', 'middle', 'bottom'])
data = 3*(np.random.randn(len(dates)))
df2 = pd.DataFrame(data, index=dates, columns=header2)
df = pd.concat([df, df2], axis=1)
header3 = pd.MultiIndex.from_product([['roof'], ['door'], ['basement']], names = ['top', 'middle', 'bottom'])
data = 2*(np.random.randn(len(dates)))
df3 = pd.DataFrame(data, index=dates, columns=header3)
df = pd.concat([df, df3], axis=1)

df = df.rename(columns={'winZZw':'window'}, level='middle')
print(df.head())

top             roof                    
middle        window                door
bottom      basement  basement  basement
2000-01-01 -0.131052 -1.189049  1.310137
2000-02-01 -0.200646  1.893930  2.124765
2000-03-01 -1.690123 -2.128965  1.639439
2000-04-01 -0.794418  0.605021 -2.810978
2000-05-01  1.528002 -0.286614  0.736445

Upvotes: 2

Related Questions