Lynn
Lynn

Reputation: 4398

De-aggregate values to unique rows using Python

I have a dataset where I would like to de aggregate the values into their own unique rows as well as perform a pivot.

Data

ID  type    Q1 24   Q2 24   
AA  hello   1       1   
AA  hi      2       1   
AA  bye     1       0   
AA  ok      0       1   
BB  hello   1       0   
BB  hi      1       0   
BB  bye     0       1   
BB  ok      1       1   

Desired

ID  date    type
AA  Q1 24   hello
AA  Q1 24   hi
AA  Q1 24   hi
AA  Q1 24   bye
AA  Q2 24   hello
AA  Q2 24   hi
AA  Q2 24   ok
BB  Q1 24   hello
BB  Q1 24   hi
BB  Q1 24   ok
BB  Q1 24   bye
BB  Q2 24   ok

Doing

import pandas as pd
import janitor

df.pivot_longer(
    index = 'ID', 
    names_to = ('date', '.value'), 
    names_pattern = r"(Q\d+)_?(.+)",
    sort_by_appearance = True)

I am currently researching, any suggestion is appreciated.

Upvotes: 2

Views: 51

Answers (2)

Naveed
Naveed

Reputation: 11650

here is one way to do it

#melt the dataframe on ID and type

df2=df.melt(id_vars=['ID','type'], var_name='date')

# the value column has the repeated values, which is used to duplicate the 
# indexes of the  DF2, then merged to repeat the rows as many times as the 
# value is repeated. finally, its sorting, reindexing and dropping unneeded
 #columns
(df2.index.repeat(df2['value']).to_frame()
   .merge(df2, left_index=True, right_index=True)
 .sort_values(['ID','date']) 
) .reset_index().drop(columns=['index','value',0]) 

for instance, for ID:AA, Q1-24, the value is 2, the df2.index.repeat this index twice, then merge with the df2.

    ID  type    date
0   AA  hello   Q1 24
1   AA  hi      Q1 24
2   AA  hi      Q1 24
3   AA  bye     Q1 24
4   AA  hello   Q2 24
5   AA  hi      Q2 24
6   AA  ok      Q2 24
7   BB  hello   Q1 24
8   BB  hi      Q1 24
9   BB  ok      Q1 24
10  BB  bye     Q2 24
11  BB  ok      Q2 24

Upvotes: 1

Chris
Chris

Reputation: 29742

IIUC, one way using pandas.Index.repeat with pandas.concat:

df = df.set_index(["ID", "type"])
new_df = pd.concat([pd.Series(c, index=df.index.repeat(df[c]))
                    for c in df]).reset_index(name="date")

# Optionally, sort values
new_df = new_df.sort_values(["ID", "date"], ignore_index=True) 

Output:

    ID   type   date
0   AA  hello  Q1 24
1   AA     hi  Q1 24
2   AA     hi  Q1 24
3   AA    bye  Q1 24
4   AA  hello  Q2 24
5   AA     hi  Q2 24
6   AA     ok  Q2 24
7   BB  hello  Q1 24
8   BB     hi  Q1 24
9   BB     ok  Q1 24
10  BB    bye  Q2 24
11  BB     ok  Q2 24

Upvotes: 2

Related Questions