Reputation: 67

Reshaping data with dates as column values

I am trying to reshape data using pandas and have been having a hard time getting it into the right format. Roughly, the data look like this*:

df = pd.DataFrame({'PRODUCT':['1','2'],
          'DESIGN_START':[pd.Timestamp('2020-01-05'),pd.Timestamp('2020-01-17')],
          'DESIGN_COMPLETE':[pd.Timestamp('2020-01-22'),pd.Timestamp('2020-03-04')],
          'PRODUCTION_START':[pd.Timestamp('2020-02-07'),pd.Timestamp('2020-03-15')],
          'PRODUCTION_COMPLETE':[np.nan,pd.Timestamp('2020-04-28')]})
print(df)

  PRODUCT DESIGN_START DESIGN_COMPLETE PRODUCTION_START PRODUCTION_COMPLETE
0       1   2020-01-05      2020-01-22       2020-02-07                 NaT
1       2   2020-01-17      2020-03-04       2020-03-15          2020-04-28

I would like to reshape the data so that it looks like this:

reshaped_df = pd.DataFrame({'DATE':[pd.Timestamp('2020-01-05'),pd.Timestamp('2020-01-17'),
                          pd.Timestamp('2020-01-22'),pd.Timestamp('2020-03-04'),
                          pd.Timestamp('2020-02-07'),pd.Timestamp('2020-03-15'),
                          np.nan,pd.Timestamp('2020-04-28')],
                  'STAGE':['design','design','design','design','production','production','production','production'],
                  'STATUS':['started','started','completed','completed','started','started','completed','completed']})

print(reshaped_df)

        DATE       STAGE     STATUS
0 2020-01-05      design    started
1 2020-01-17      design    started
2 2020-01-22      design  completed
3 2020-03-04      design  completed
4 2020-02-07  production    started
5 2020-03-15  production    started
6        NaT  production  completed
7 2020-04-28  production  completed

How can I go about doing this? Is there a better format to reshape it to?

Ultimately I'd like to do some group summaries on the data, such as the number of times each step occurred, e.g.

reshaped_df.groupby(['STAGE','STATUS'])['DATE'].count()

STAGE       STATUS   
design      completed    2
            started      2
production  completed    1
            started      2
Name: DATE, dtype: int64

Thank you

The data actually contain many date start/stop columns for different stages of the manufacturing pipeline

Upvotes: 2

Answers (4)

sammywemmy

Reputation: 28729

convert columns to lowercase and split on '_' ... setting expand=True converts it to a MultiIndex:

df.columns = df.columns.str.lower().str.split('_',expand=True)
df.columns = df.columns.set_names(['stage','status'])

print(df)

product              design             production
NaN       start     complete    start      complete
0   1   2020-01-05  2020-01-22  2020-02-07  NaT
1   2   2020-01-17  2020-03-04  2020-03-15  2020-04-28

Next phase is a combination of stack, sort values, droplevel, reset index, and reindex :

res = (df
       .stack([0,1])
       .sort_values()
       .droplevel(0)
       .reset_index(name='Date')
       .reindex(['Date','stage','status'],axis=1)
      )

res


      DATE      STAGE       STATUS
0   2020-01-05  design      start
1   2020-01-17  design      start
2   2020-01-22  design      complete
3   2020-02-07  production  start
4   2020-03-04  design      complete
5   2020-03-15  production  start
6   2020-04-28  production  complete

if you are interested in just getting the groupings and an aggregation, then u can skip the long path and just take off after the stack :

df.stack([0,1]).groupby(['stage','status']).count()


  stage       status  
design      complete    2
            start       2
production  complete    1
            start       2
Name: Date, dtype: int64

UPDATE 2021/06/01:

You can use the pivot_longer function from pyjanitor to abstract the reshaping; at the moment you have to install the latest development version from github:

  # install the latest dev version of pyjanitor
  # pip install git+https://github.com/ericmjl/pyjanitor.git
   import janitor
   df.rename(columns=str.lower).pivot_longer(
    index="product",
    names_sep="_",
    names_to=("stage", "status"),
    values_to="date",
)

  product   stage      status       date
0   1       design      start       2020-01-05
1   2       design      start       2020-01-17
2   1       design      complete    2020-01-22
3   2       design      complete    2020-03-04
4   1       production  start       2020-02-07
5   2       production  start       2020-03-15
6   1       production  complete    NaT
7   2       production  complete    2020-04-28

Upvotes: 1

BENY

Reputation: 323366

We can do pd.wide_to_long with stack and reorder the df

s=pd.wide_to_long(df,['DESIGN','PRODUCTION'],i='PRODUCT',j='STATUS',suffix='\w+',sep='_').\
     stack(dropna=False).reset_index(level=[1,2]).sort_values('level_2').\
       reset_index(drop=True).rename(columns={'level_2':'STAGE',0:'DATE'})
     STATUS       STAGE       DATE
0     START      DESIGN 2020-01-05
1     START      DESIGN 2020-01-17
2  COMPLETE      DESIGN 2020-01-22
3  COMPLETE      DESIGN 2020-03-04
4     START  PRODUCTION 2020-02-07
5     START  PRODUCTION 2020-03-15
6  COMPLETE  PRODUCTION        NaT
7  COMPLETE  PRODUCTION 2020-04-28

Upvotes: 1

bherbruck

Reputation: 2226

MELT IT!!!

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'PRODUCT':['1','2'],
    'DESIGN_START':[pd.Timestamp('2020-01-05'),pd.Timestamp('2020-01-17')],
    'DESIGN_COMPLETE':[pd.Timestamp('2020-01-22'),pd.Timestamp('2020-03-04')],
    'PRODUCTION_START':[pd.Timestamp('2020-02-07'),pd.Timestamp('2020-03-15')],
    'PRODUCTION_COMPLETE':[np.nan,pd.Timestamp('2020-04-28')]
})

df = df.melt(id_vars=['PRODUCT'])
df_split = df['variable'].str.split('_', n=1, expand=True)
df['STAGE'] = df_split[0]
df['STATUS'] = df_split[1]
df.drop(columns=['variable'], inplace=True)
df = df.rename(columns={'value': 'DATE'})

print(df)

Output:

  PRODUCT       DATE       STAGE    STATUS
0       1 2020-01-05      DESIGN     START
1       2 2020-01-17      DESIGN     START
2       1 2020-01-22      DESIGN  COMPLETE
3       2 2020-03-04      DESIGN  COMPLETE
4       1 2020-02-07  PRODUCTION     START
5       2 2020-03-15  PRODUCTION     START
6       1        NaT  PRODUCTION  COMPLETE
7       2 2020-04-28  PRODUCTION  COMPLETE

MWAHAHAHAHAHAHA!!! FEEL THE POWER OF THE MELT!!!

Melt is basically unpivot

Upvotes: 2

Code Different

Reputation: 93191

Drop PRODUCT, modify the columns into a MultiIndex and stack them:

new_cols = pd.MultiIndex.from_product([['design', 'production'], ['started', 'completed']], names=['STAGE', 'STATUS'])
df.drop(columns='PRODUCT') \
    .set_axis(new_cols, axis=1) \
    .stack([0,1]) \
    .groupby(['STAGE', 'STATUS']) \
    .count()

Upvotes: 1

Reshaping data with dates as column values

Answers (4)

Related Questions