Reputation: 972
I have a large dataframe a
that I would like to split or explode to become dataframe b
(the real dataframe a
contains 90 columns).
I tried to look up for solutions to a problem similar to this but I did not find since it is not related to the values in cells but to the column names.
Any pointer to the solution or to using an existing function in the pandas library would be appreciated.
Thank you in advance.
from pandas import DataFrame
import numpy as np
# current df
a = DataFrame([{'ID': 'ID_1', 'A-1': 'a1', 'B-1':'b1','C-1':'c1', 'A-2': 'a2', 'B-2':'b2','C-2':'c2'}])
# desired df
b = DataFrame([{'ID': 'ID_1', 'A': 'a1', 'B':'b1', 'C':'c1'},
{'ID': 'ID_1','A': 'a2', 'B':'b2','C':'c2'}])
One idea I have is to to split this dataframe into two dataframes (Dataframe 1 will contain columns from A1 to C1 and Dataframe 2 will contain columns from A2 to C2 ) rename the columns to A/B/C and than concatenate both. But I am not sure in terms of efficiency since I have 90 Columns that will grow over time.
Upvotes: 2
Views: 168
Reputation: 28729
One option is with the pivot_longer function from pyjanitor, which abstracts the reshaping process and is also efficient:
# pip install pyjanitor
import janitor
import pandas as pd
a.pivot_longer(index="ID", names_to=".value", names_pattern="(.).+")
ID A B C
0 ID_1 a1 b1 c1
1 ID_1 a2 b2 c2
The .value
tells the function which part of the columns to retain. It takes its cue from the names_pattern
, which should be a regular expression with groups, the grouped regex are what stay as headers. In this case, the first letter of each column is what we are interested in, which is represented by (.)
.
Another option, with pivot_longer, is to use the names_sep
parameter:
(a.pivot_longer(index="ID", names_to=(".value", "num"), names_sep="-")
.drop(columns="num")
)
ID A B C
0 ID_1 a1 b1 c1
1 ID_1 a2 b2 c2
Again, only values in the columns associated with .value
remain as headers.
Upvotes: 1
Reputation: 35686
pd.wide_to_long works well here assuming a small number of known stubnames:
b = (
pd.wide_to_long(a, stubnames=['A', 'B', 'C'], sep='-', i='ID', j='to_drop')
.droplevel(level='to_drop')
.reset_index()
)
ID A B C
0 ID_1 a1 b1 c1
1 ID_1 a2 b2 c2
Alternatively set_index, split the columns on '-' with str.split and stack:
b = a.set_index('ID')
b.columns = b.columns.str.split('-', expand=True)
b = b.stack().droplevel(-1).reset_index()
ID A B C
0 ID_1 a1 b1 c1
1 ID_1 a2 b2 c2
Upvotes: 2
Reputation: 2541
This approach will generate some intermediate columns which will be removed later on.
First bring down those labels (A-1,...) from the header into a column
df = pd.melt(a, id_vars=['ID'], var_name='label')
Then split the label into character and number
df[['char', 'num']] = df['label'].str.split('-', expand=True)
Finally drop the label, set_index
before unstack
, and take care of the final table formats.
df.drop('label', axis=1)\
.set_index(['ID', 'num', 'char'])\
.unstack()\
.droplevel(0, axis=1)\
.reset_index()\
.drop('num', axis=1)
Upvotes: 2
Reputation: 304
import pandas as pd
import math
df=pd.DataFrame(data={k:[i*k for i in range(1,5)] for k in range (1,9)})
assert(df.shape[1]%2==0)
df_1=df.iloc[:,0:math.floor(df.shape[1]/2)]
df_2=df.iloc[:,math.floor(df.shape[1]/2):]
df_2.columns=df_1.columns
df_sum=pd.concat((df_1,df_2),axis=0)
display(df_sum)
Upvotes: 0