Andi
Andi

Reputation: 4899

How to replace an individual level in a multi-level column index in pandas

Consider the following multi-level column index dataframe:

import numpy as np
import pandas as pd

arrays = [
    ["A", "A", "B", "B"],
    ["one", "two", "one", "two"],
    ["1", "2", "1", "pd.NA"],
]
idx = pd.MultiIndex.from_arrays(arrays, names=["level_0", "level_1", "level_2"])
data = np.random.randn(3, 4)
df = pd.DataFrame(data, columns=idx)
print(df)

level_0         A                   B          
level_1       one       two       one       two
level_2         1         2         1     pd.NA
0       -1.249285  0.314225  0.011139  0.675274
1       -0.654808 -0.492350  0.596338 -0.087334
2        0.113570  0.566687 -0.361334  0.085368

level_2 holds values of type object (str really).

df.columns.get_level_values(2)
Index(['1', '2', '1', 'pd.NA'], dtype='object', name='level_2')

I need to parse it to the correct data type and change this particular column level.

new_level_2 = [
    pd.NA if x == "pd.NA" else int(x) for x in df.columns.get_level_values(2)
]

I am looking for a pythonic way to replace the old level_2 with new_level_2.

Upvotes: 4

Views: 108

Answers (4)

rhug123
rhug123

Reputation: 8778

One option is to use rename()

df.rename(lambda x: np.NaN if x=='pd.NA' else int(x), level=2, axis=1)

Output:

          A                   B          
        one       two       one       two
          1         2         1       NaN
0 -0.270906 -1.650086 -0.919673  0.878767
1  0.557711  0.797267 -0.433733  0.516062
2  3.799616  0.333024 -0.159180  0.003272

Upvotes: 0

ouroboros1
ouroboros1

Reputation: 14414

You can use MultiIndex.set_levels, MultiIndex.levels and Index.where. Chain Index.astype for proper dtype conversion:

df.columns = (
    df.columns.set_levels(
        df.columns.levels[2].where(
            df.columns.levels[2] != 'pd.NA', 
            pd.NA
        ).astype('Int64'), 
        level=2
    )
)

Or using pd.to_numeric:

df.columns = (
    df.columns.set_levels(
        pd.to_numeric(df.columns.levels[2], errors='coerce')
        .astype('Int64'),
        level=2
    )
)

Output:

level_0         A                   B          
level_1       one       two       one       two
level_2      1         2         1         <NA>
0        1.764052  0.400157  0.978738  2.240893
1        1.867558 -0.977278  0.950088 -0.151357
2       -0.103219  0.410599  0.144044  1.454274

Performance comparison

If performance is a concern, a comparison seems to suggest:

  • Use list comprehension for a small MultiIndex.
  • Use Index.where for a big one.

Setup data

import pandas as pd
import numpy as np

def create_columns(N):

    np.random.seed(0)
    
    array = np.arange(N).astype(str)
    num_nans = int(0.2 * N) # 20% 'pd.NA'
    nan_indices = np.random.choice(N, num_nans, replace=False)
    array[nan_indices] = 'pd.NA'
    
    columns = pd.MultiIndex.from_product([
        ["A", "B"],
        ["one", "two"],
        array
        ], names=["level_0", "level_1", "level_2"])
    return columns

Functions

def replace(columns):
    return columns.from_frame(columns.to_frame()
                                      .replace({'level_2': {'pd.NA': pd.NA}})
                                      .astype({'level_2': 'Int64'}))

def list_comprehension(columns):
    return columns.set_levels(
        pd.Index([pd.NA if x == "pd.NA" else int(x) for x in columns.levels[2]],
                 dtype='Int64'), 
        level='level_2')

def where(columns):
    return columns.set_levels(
            columns.levels[2].where(
                columns.levels[2] != 'pd.NA', 
                pd.NA
            ).astype('Int64'), 
            level=2)

def pd_numeric(columns):
    return columns.set_levels(
        pd.to_numeric(columns.levels[2], errors='coerce')
        .astype('Int64'),
        level=2
    )

Test equality results:

cols = create_columns(10)

all([replace(cols).equals(x(cols)) 
     for x in [list_comprehension, where, pd_numeric]])

# True

Perfplot

import perfplot

perfplot.show(
    setup=lambda n: create_columns(n),
    kernels=[
        replace,
        list_comprehension,
        where,
        pd_numeric
    ],
    labels=['replace', 'list_comprehension', 'where', 'pd_numeric'],
    n_range=[2**k for k in range(20)],
    xlabel="N",
    equality_check=None
)

Output:

comparison results

Upvotes: 1

Federicofkt
Federicofkt

Reputation: 693

Just use set_levels:

df.columns = df.columns.set_levels(new_level_2, level='level_2')

Upvotes: 0

mozway
mozway

Reputation: 262254

You could convert the MultiIndex.to_frame then back to MultiIndex.from_frame, change the values with replace and as_type:

df.columns = pd.MultiIndex.from_frame(df.columns.to_frame()
                                      .replace({'level_2': {'pd.NA': pd.NA}})
                                      .astype({'level_2': 'Int64'}))

Output:

level_0         A                   B          
level_1       one       two       one       two
level_2         1         2         3      <NA>
0        0.144044  1.454274  0.761038  0.121675
1        0.443863  0.333674  1.494079 -0.205158
2        0.313068 -0.854096 -2.552990  0.653619

Upvotes: 2

Related Questions