Sam
Sam

Reputation: 229

How to transform a large data frame

I'd like to transform the following data frame, but can't seem to get the right function to do it. Any time I use 'melt', I am prompted to convert to an array, though an array does not seem to accept column names.

In any case, any help with turning this:

A B C1 C1.A C2 C2.A
PC 11001 Core Old SE New

Into this:

A B C C.A
PC 11001 Core Old
PC 11001 SE New

Would be greatly appreciated.

Code for the first and second df's, respectively.

df1 = {'A': ['PC'], 'B': [11001],'C1':['Core'],'C1.A':['Old'],'C2':['SE'],'C2.A':['New']}

df1 = pd.DataFrame(data=df1)`


df2 = {'A': ['PC','PC'], 'B': [11001,11001],'C':['Core','SE'],'C.A':['Old','New']}

df2 = pd.DataFrame(data=df2)

Upvotes: 3

Views: 369

Answers (4)

sammywemmy
sammywemmy

Reputation: 28644

In addition to the other solutions already proferred, pivot_longer from pyjanitor is another option -> your columns have a pattern that we could take advantage of (some end with numbers, others end with A):

# pip install pyjanitor
import janitor
import pandas as pd
df.pivot_longer(index=['A', 'B'], 
                names_to = ['C', 'C.A'], 
                names_pattern = ['.+\d$', '.+A$'])
 
    A      B     C  C.A
0  PC  11001  Core  Old
1  PC  11001    SE  New

the columns in names_to become the new column names and are mapped to the old columns that match the regular expression in names_pattern.

Upvotes: 3

SCKU
SCKU

Reputation: 833

thanks for your code typing. I will take it. :)
Here are some ideas:

  • you already hardcode the one-row result, it's a good start.
  • we can use that code with a custom function, then use groupy.apply to generate a new two-row result (dataframe) from one-row data. groupby.apply will take care of the combination with your new dataframes.
  • after that, reindex if needed.

first, we make a custom apply function for groupby:

def make_new_df_from_row(df):
    row = df.iloc[0, :] # df is each df split by groupby
    dd = pd.DataFrame({ #configure your hardcode
        'A': row.A, 'B': row.B,
        'C':[row.C1, row.C2],
        'C.A':[row['C1.A'], row['C2.A']] #column name with '.' cannot use row.C1.A
    })
    return dd

then, use df2 = df1.groupby(lambda x:x, group_keys=False).apply(make_new_df_from_row) to get result your want.

there is a screenshot from colab: enter image description here

some notes:

  • groupby accept a function which use index as input and split dataframe from the function output, here just use a lambda function, return index itself, to split dataframe by each single row (one-row dataframe).
  • group_keys=False means don't return the key (grouped index)
  • groupby.apply accept a function which input is each groupby dataframe, you can return a pd.Series or pd.Dataframe depend on your need. Here we return the two-row dataframe from your code typing (with some variable change)

Upvotes: 1

Shubham Sharma
Shubham Sharma

Reputation: 71689

Transform the columns to MultiIndex by using replace followed by split, then stack to reshape the frame to convert it into the desired format

df = df1.set_index(['A', 'B'])
df.columns = df.columns\
               .str.replace(r'^(\D+)(\d+)(.*)', r'\1\3_\2')\
               .str.split('_', expand=True)

df = df.stack().droplevel(2).reset_index()

    A      B     C  C.A
0  PC  11001  Core  Old
1  PC  11001    SE  New

Upvotes: 7

Henry Ecker
Henry Ecker

Reputation: 35626

With a little column reformatting, pd.wide_to_long can be used:

df.columns = df.columns.str.replace(r'C(\d+).A', r'C.A\1', regex=True)
df = (
    pd.wide_to_long(df, stubnames=['C', 'C.A'], i=['A', 'B'], j='drop')
        .droplevel(axis=0, level='drop')
        .reset_index()
)

df:

    A      B     C  C.A
0  PC  11001  Core  Old
1  PC  11001    SE  New

Change the columns in df from:

    A      B    C1 C1.A  C2 C2.A
0  PC  11001  Core  Old  SE  New

To

df.columns = df.columns.str.replace(r'C(\d+).A', r'C.A\1', regex=True)
    A      B    C1 C.A1  C2 C.A2
0  PC  11001  Core  Old  SE  New

Then wide_to_long on the stubnames C and C.A


An alternative without affecting df:

new_df = (
    pd.wide_to_long(
        df.rename(columns=dict(
            zip(df.columns,
                df.columns.str.replace(r'C(\d+).A', r'C.A\1', regex=True)))
        ),
        stubnames=['C', 'C.A'],
        i=['A', 'B'],
        j='drop')
        .droplevel(axis=0, level='drop')
        .reset_index()
)

new_df:

    A      B     C  C.A
0  PC  11001  Core  Old
1  PC  11001    SE  New

df:

    A      B    C1 C1.A  C2 C2.A
0  PC  11001  Core  Old  SE  New

Upvotes: 4

Related Questions