Cranjis
Cranjis

Reputation: 1960

dataframe how pivot table based on substring of the column

I have a dataframe:

df = 
time id ser1 ser2 ... ser20 N0ch0 N1ch0 N2ch0 N0ch1 N1ch1 N2ch1 N0ch2 N1ch2 N2ch2 N0ch3 N1ch3 N2ch3
  1   2  4    5         3     8     7     8    5     1      4    6     2      7    9    8      6

And I want to pivot it based on the channel ('ch' substring), such that it will become a column, so new dataframe will be:

time id channel ser1 ser2 ... ser20 N0 N1 N2
  1   2   0      4    5         3   8  7  8
  1   2   1      4    5         3   5  1  4
  1   2   2      4    5         3   6  2  7
  1   2   3      4    5         3   9  8  6

What is the best way to do so?

Upvotes: 0

Views: 724

Answers (2)

Henry Ecker
Henry Ecker

Reputation: 35676

We can use set_index to save any columns which should be unmodified. Then str.split the remaining columns on 'ch' which appears to be the delimiter between new column name and channel number. Then stack and reset_index in order to go from MultiIndex columns to long form. Follow up with astype to turn the new channel column into an int from a a string (if needed).

# columns to save
idx_cols = ['time', 'id', 'ser1', 'ser2']
res = df.set_index(idx_cols)
# Separate N value from channel number
res.columns = res.columns.str.split('ch', expand=True).rename([None, 'channel'])
# Go to long form
res = res.stack().reset_index()
# Convert to number from string
res['channel'] = res['channel'].astype(int)

res:

   time  id  ser1  ser2  channel  N0  N1  N2
0     1   2     4     5        0   8   7   8
1     1   2     4     5        1   5   1   4

Alternatively wide_to_long can be used which abstracts some of the reshaping, but requires a follow up str.extract to get the channel number, and manually specifying all "stubnames":

# columns to save
idx_cols = ['time', 'id', 'ser1', 'ser2']
res = (
    pd.wide_to_long(
        df,
        i=idx_cols,
        j='channel',
        stubnames=['N0', 'N1', 'N2'],  # all stub names (add more if needed)
        suffix=r'ch\d+'  # suffix
    ).reset_index()
)
# Get only the channel numbers and convert to int
res['channel'] = res['channel'].str.extract(r'(\d+$)').astype(int)

res

   time  id  ser1  ser2  channel  N0  N1  N2
0     1   2     4     5        0   8   7   8
1     1   2     4     5        1   5   1   4

Note for either option idx_cols can be created dynamically instead of manually.

By slicing first n columns (4 for this sample code):

idx_cols = df.columns[:4]

Or by filtering the DataFrame columns based on condition (like str.startswith:

idx_cols = ['time', 'id', *df.columns[df.columns.str.startswith('ser')]]

Sample Setup:

import pandas as pd

df = pd.DataFrame({
    'time': [1], 'id': [2], 'ser1': [4], 'ser2': [5],
    'N0ch0': [8], 'N1ch0': [7], 'N2ch0': [8],
    'N0ch1': [5], 'N1ch1': [1], 'N2ch1': [4]
})

Upvotes: 0

sophocles
sophocles

Reputation: 13831

You can start by using melt with parameter id_vars set to your 'ser' like columns and 'time' + 'id'.

You can then split the 'variable' column into 2, where one of the columns will be used as an index column when using pivot_table, and the other will be a column:

# Columns to be used as index in melt & pivot
id_cols = ['time','id'] + list(df.filter(like='ser'))

# Melt and split a column
m = df.melt(id_vars = id_cols)
m[['N','channel']] = m.variable.str.split('ch', 1 ,expand=True)

# Pivot the melted dataframe
out = m.pivot_table(index = id_cols + ['channel'],  columns='N', values='value').reset_index()

prints:

   time  id channel  ser1  ser2  ser20  N0  N1  N2
0     1   2       0     4     5      3   8   7   8
1     1   2       1     4     5      3   5   1   4
2     1   2       2     4     5      3   6   2   7
3     1   2       3     4     5      3   9   8   6

Upvotes: 1

Related Questions