sel
sel

Reputation: 972

Explode data frame columns into multiple rows

I have a large dataframe a that I would like to split or explode to become dataframe b (the real dataframe a contains 90 columns).

I tried to look up for solutions to a problem similar to this but I did not find since it is not related to the values in cells but to the column names.

Any pointer to the solution or to using an existing function in the pandas library would be appreciated.

Thank you in advance.

from pandas import DataFrame
import numpy as np
# current df
a = DataFrame([{'ID': 'ID_1', 'A-1': 'a1', 'B-1':'b1','C-1':'c1', 'A-2': 'a2', 'B-2':'b2','C-2':'c2'}])

# desired df
b = DataFrame([{'ID': 'ID_1', 'A': 'a1', 'B':'b1', 'C':'c1'},
               {'ID': 'ID_1','A': 'a2', 'B':'b2','C':'c2'}])

current df current df

desired df desired df

One idea I have is to to split this dataframe into two dataframes (Dataframe 1 will contain columns from A1 to C1 and Dataframe 2 will contain columns from A2 to C2 ) rename the columns to A/B/C and than concatenate both. But I am not sure in terms of efficiency since I have 90 Columns that will grow over time.

Upvotes: 2

Views: 168

Answers (4)

sammywemmy
sammywemmy

Reputation: 28729

One option is with the pivot_longer function from pyjanitor, which abstracts the reshaping process and is also efficient:

# pip install pyjanitor
import janitor
import pandas as pd

a.pivot_longer(index="ID", names_to=".value", names_pattern="(.).+")
 
     ID   A   B   C
0  ID_1  a1  b1  c1
1  ID_1  a2  b2  c2

The .value tells the function which part of the columns to retain. It takes its cue from the names_pattern, which should be a regular expression with groups, the grouped regex are what stay as headers. In this case, the first letter of each column is what we are interested in, which is represented by (.).

Another option, with pivot_longer, is to use the names_sep parameter:

(a.pivot_longer(index="ID", names_to=(".value", "num"), names_sep="-")
.drop(columns="num")
)

     ID   A   B   C
0  ID_1  a1  b1  c1
1  ID_1  a2  b2  c2

Again, only values in the columns associated with .value remain as headers.

Upvotes: 1

Henry Ecker
Henry Ecker

Reputation: 35686

pd.wide_to_long works well here assuming a small number of known stubnames:

b = (
    pd.wide_to_long(a, stubnames=['A', 'B', 'C'], sep='-', i='ID', j='to_drop')
        .droplevel(level='to_drop')
        .reset_index()
)

     ID   A   B   C
0  ID_1  a1  b1  c1
1  ID_1  a2  b2  c2

Alternatively set_index, split the columns on '-' with str.split and stack:

b = a.set_index('ID')
b.columns = b.columns.str.split('-', expand=True)
b = b.stack().droplevel(-1).reset_index()

     ID   A   B   C
0  ID_1  a1  b1  c1
1  ID_1  a2  b2  c2

Upvotes: 2

Raymond Kwok
Raymond Kwok

Reputation: 2541

This approach will generate some intermediate columns which will be removed later on.

First bring down those labels (A-1,...) from the header into a column

df = pd.melt(a, id_vars=['ID'], var_name='label')

Then split the label into character and number

df[['char', 'num']] = df['label'].str.split('-', expand=True)

Finally drop the label, set_index before unstack, and take care of the final table formats.

df.drop('label', axis=1)\
    .set_index(['ID', 'num', 'char'])\
    .unstack()\
    .droplevel(0, axis=1)\
    .reset_index()\
    .drop('num', axis=1)

Upvotes: 2

Timo Junolainen
Timo Junolainen

Reputation: 304

import pandas as pd
import math
df=pd.DataFrame(data={k:[i*k for i in range(1,5)] for k in range (1,9)})
assert(df.shape[1]%2==0)
df_1=df.iloc[:,0:math.floor(df.shape[1]/2)]
df_2=df.iloc[:,math.floor(df.shape[1]/2):]
df_2.columns=df_1.columns
df_sum=pd.concat((df_1,df_2),axis=0)
display(df_sum)

Like this?enter image description here

Upvotes: 0

Related Questions