Reputation: 4899
Consider the following multi-level column index dataframe:
import numpy as np
import pandas as pd
arrays = [
["A", "A", "B", "B"],
["one", "two", "one", "two"],
["1", "2", "1", "pd.NA"],
]
idx = pd.MultiIndex.from_arrays(arrays, names=["level_0", "level_1", "level_2"])
data = np.random.randn(3, 4)
df = pd.DataFrame(data, columns=idx)
print(df)
level_0 A B
level_1 one two one two
level_2 1 2 1 pd.NA
0 -1.249285 0.314225 0.011139 0.675274
1 -0.654808 -0.492350 0.596338 -0.087334
2 0.113570 0.566687 -0.361334 0.085368
level_2
holds values of type object
(str
really).
df.columns.get_level_values(2)
Index(['1', '2', '1', 'pd.NA'], dtype='object', name='level_2')
I need to parse it to the correct data type and change this particular column level.
new_level_2 = [
pd.NA if x == "pd.NA" else int(x) for x in df.columns.get_level_values(2)
]
I am looking for a pythonic way to replace the old level_2
with new_level_2
.
Upvotes: 4
Views: 108
Reputation: 8778
One option is to use rename()
df.rename(lambda x: np.NaN if x=='pd.NA' else int(x), level=2, axis=1)
Output:
A B
one two one two
1 2 1 NaN
0 -0.270906 -1.650086 -0.919673 0.878767
1 0.557711 0.797267 -0.433733 0.516062
2 3.799616 0.333024 -0.159180 0.003272
Upvotes: 0
Reputation: 14414
You can use MultiIndex.set_levels
, MultiIndex.levels
and Index.where
. Chain Index.astype
for proper dtype conversion:
df.columns = (
df.columns.set_levels(
df.columns.levels[2].where(
df.columns.levels[2] != 'pd.NA',
pd.NA
).astype('Int64'),
level=2
)
)
Or using pd.to_numeric
:
df.columns = (
df.columns.set_levels(
pd.to_numeric(df.columns.levels[2], errors='coerce')
.astype('Int64'),
level=2
)
)
Output:
level_0 A B
level_1 one two one two
level_2 1 2 1 <NA>
0 1.764052 0.400157 0.978738 2.240893
1 1.867558 -0.977278 0.950088 -0.151357
2 -0.103219 0.410599 0.144044 1.454274
Performance comparison
If performance is a concern, a comparison seems to suggest:
Index.where
for a big one.Setup data
import pandas as pd
import numpy as np
def create_columns(N):
np.random.seed(0)
array = np.arange(N).astype(str)
num_nans = int(0.2 * N) # 20% 'pd.NA'
nan_indices = np.random.choice(N, num_nans, replace=False)
array[nan_indices] = 'pd.NA'
columns = pd.MultiIndex.from_product([
["A", "B"],
["one", "two"],
array
], names=["level_0", "level_1", "level_2"])
return columns
Functions
replace
: answer by @mozwaylist_comprehension
: corrected version of answer by @federicofktwhere
: above, first optionpd_numeric
: above, second optiondef replace(columns):
return columns.from_frame(columns.to_frame()
.replace({'level_2': {'pd.NA': pd.NA}})
.astype({'level_2': 'Int64'}))
def list_comprehension(columns):
return columns.set_levels(
pd.Index([pd.NA if x == "pd.NA" else int(x) for x in columns.levels[2]],
dtype='Int64'),
level='level_2')
def where(columns):
return columns.set_levels(
columns.levels[2].where(
columns.levels[2] != 'pd.NA',
pd.NA
).astype('Int64'),
level=2)
def pd_numeric(columns):
return columns.set_levels(
pd.to_numeric(columns.levels[2], errors='coerce')
.astype('Int64'),
level=2
)
Test equality results:
cols = create_columns(10)
all([replace(cols).equals(x(cols))
for x in [list_comprehension, where, pd_numeric]])
# True
Perfplot
import perfplot
perfplot.show(
setup=lambda n: create_columns(n),
kernels=[
replace,
list_comprehension,
where,
pd_numeric
],
labels=['replace', 'list_comprehension', 'where', 'pd_numeric'],
n_range=[2**k for k in range(20)],
xlabel="N",
equality_check=None
)
Output:
Upvotes: 1
Reputation: 693
Just use set_levels
:
df.columns = df.columns.set_levels(new_level_2, level='level_2')
Upvotes: 0
Reputation: 262254
You could convert the MultiIndex.to_frame
then back to MultiIndex.from_frame
, change the values with replace
and as_type
:
df.columns = pd.MultiIndex.from_frame(df.columns.to_frame()
.replace({'level_2': {'pd.NA': pd.NA}})
.astype({'level_2': 'Int64'}))
Output:
level_0 A B
level_1 one two one two
level_2 1 2 3 <NA>
0 0.144044 1.454274 0.761038 0.121675
1 0.443863 0.333674 1.494079 -0.205158
2 0.313068 -0.854096 -2.552990 0.653619
Upvotes: 2