TiTo
TiTo

Reputation: 865

How to extract period and variable name from dataframe column strings for multiindex panel data preparation

I'm new to Python and could not find the answer I'm looking for anywhere.

I have a DataFrame that has the following structure:

df = pd.DataFrame(index=list('abc'), data={'A1': range(3), 'A2': range(3),'B1': range(3), 'B2': range(3), 'C1': range(3), 'C2': range(3)})
df
Out[1]:
   A1 A2 B1 B2 C1 C2
a  0  0  0  0  0  0
b  1  1  1  1  1  1
c  2  2  2  2  2  2

Where the numbers are periods and he letters are variables. I'd like to transform the columns in a way, that I split the periods and variables into a multiindex. The desired output would look like that

   A     B     C   
   1  2  1  2  1  2
a  0  0  0  0  0  0
b  1  1  1  1  1  1
c  2  2  2  2  2  2

I've tried the following:

periods = list(range(1, 3))
df.columns = df.columns.str.replace('\d+', '')
df.columns = pd.MultiIndex.from_product([df.columns, periods])

That seams to be multiplying the columns and raising an ValueError: Length mismatch

in my dataframe I have 72 periods and 12 variables.

Thanks in advance for your help!

Edit: I realized that I haven't been precise enough. I have several columns names something like Impressions1, Impressions2...Impressions72 and hhi1, hhi2...hhi72. So df.columns.str[0],df.columns.str[1] does not work for me, as all column names have a different length. I think the solution might contain regex but I can't figure out how to do it. Any ideas?

Upvotes: 2

Views: 277

Answers (3)

TiTo
TiTo

Reputation: 865

Here is what actually solved my issue:

df.columns = pd.MultiIndex.from_frame(df.columns.str.extract(r'([a-zA-Z]+)([0-9]+)'), names=[None, None])

Thanks @Scott Boston for your inspiration to the solution!

Upvotes: 0

Scott Boston
Scott Boston

Reputation: 153460

You can also use, .str.extract and from_frame:

df.columns = pd.MultiIndex.from_frame(df.columns.str.extract('(.)(.)'), names=[None, None])

Output:

   A     B     C   
   1  2  1  2  1  2
a  0  0  0  0  0  0
b  1  1  1  1  1  1
c  2  2  2  2  2  2

Upvotes: 2

ansev
ansev

Reputation: 30920

Use pd.MultiIndex.from_tuples:

df.columns = pd.MultiIndex.from_tuples(list(zip(df.columns.str[0],df.columns.str[1])))
print(df)
   A     B     C   
   1  2  1  2  1  2
a  0  0  0  0  0  0
b  1  1  1  1  1  1
c  2  2  2  2  2  2

Alternative:

pd.MultiIndex.from_tuples([tuple(name) for name in df.columns])

or

pd.MultiIndex.from_tuples(map(tuple, df.columns))

Upvotes: 3

Related Questions