Ank
Ank

Reputation: 1704

Pandas - Reshape a dataframe columns based on repeated key-value pair columns with duplicate names

I am receiving a dataframe from an API that has data for some entities in the inital columns, and the later columns contain metadata for those entities. The data is proprietary so I cannot show it exactly, but the structure of dataframe is exactly like below:

idx  name types  value value_date        desc group owner key_name key_value   key_name   key_value   key_name  key_value
  1 name1 type1     45 2021-05-30 name1-type1    G1    O1       k1        A1         k2          A2         k3         A3
  2 name1 type2     23 2021-05-30 name1-type2    G1    O1       k3        B3        NaN         NaN         k2         B2
  3 name2 type1     41 2021-05-30 name2-type1    G2    O2      NaN       NaN         k1          C1        NaN        NaN
  4 name3 type1     39 2021-05-30 name3-type1    G1    O1       k1        D1         k3          D3        NaN        NaN
  5 name4 type1     40 2021-05-30 name4-type1    G3    O3       k1        E1         k3          E3         k2         E2
  6 name4 type2     21 2021-05-30 name4-type2    G3    O3       k3        F3         k2          F2         k1         F1
  7 name4 type3     11 2021-05-30 name4-type3    G3    O3      NaN       NaN        NaN         NaN        NaN        NaN
  8 name5 type1     44 2021-05-30 name5-type1    G1    O1      NaN       NaN         k1          H1        NaN        NaN
  9 name6 type1     49 2021-05-30 name6-type1    G2    O2      NaN       NaN         k2          I2        NaN        NaN
 10 name6 type2     26 2021-05-30 name6-type2    G2    O2       k1        J1        NaN         NaN         k3         J3

Below code will generate the above sample dataframe:

df = pd.DataFrame( {'idx': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10}, 'name': {0: 'name1', 1: 'name1', 2: 'name2', 3: 'name3', 4: 'name4', 5: 'name4', 6: 'name4', 7: 'name5', 8: 'name6', 9: 'name6'}, 'types': {0: 'type1', 1: 'type2', 2: 'type1', 3: 'type1', 4: 'type1', 5: 'type2', 6: 'type3', 7: 'type1', 8: 'type1', 9: 'type2'}, 'value': {0: 45, 1: 23, 2: 41, 3: 39, 4: 40, 5: 21, 6: 11, 7: 44, 8: 49, 9: 26}, 'value_date': {0: '2021-05-30', 1: '2021-05-30', 2: '2021-05-30', 3: '2021-05-30', 4: '2021-05-30', 5: '2021-05-30', 6: '2021-05-30', 7: '2021-05-30', 8: '2021-05-30', 9: '2021-05-30'}, 'desc': {0: 'name1-type1', 1: 'name1-type2', 2: 'name2-type1', 3: 'name3-type1', 4: 'name4-type1', 5: 'name4-type2', 6: 'name4-type3', 7: 'name5-type1', 8: 'name6-type1', 9: 'name6-type2'}, 'group': {0: 'G1', 1: 'G1', 2: 'G2', 3: 'G1', 4: 'G3', 5: 'G3', 6: 'G3', 7: 'G1', 8: 'G2', 9: 'G2'}, 'owner': {0: 'O1', 1: 'O1', 2: 'O2', 3: 'O1', 4: 'O3', 5: 'O3', 6: 'O3', 7: 'O1', 8: 'O2', 9: 'O2'}, 'key_name': {0: 'k1', 1: 'k3', 2: float('nan'), 3: 'k1', 4: 'k1', 5: 'k3', 6: float('nan'), 7: float('nan'), 8: float('nan'), 9: 'k1'}, 'key_value': {0: 'A1', 1: 'B3', 2: float('nan'), 3: 'D1', 4: 'E1', 5: 'F3', 6: float('nan'), 7: float('nan'), 8: float('nan'), 9: 'J1'}, 'key_name_1': {0: 'k2', 1: float('nan'), 2: 'k1', 3: 'k3', 4: 'k3', 5: 'k2', 6: float('nan'), 7: 'k1', 8: 'k2', 9: float('nan')}, 'key_value_1': {0: 'A2', 1: float('nan'), 2: 'C1', 3: 'D3', 4: 'E3', 5: 'F2', 6: float('nan'), 7: 'H1', 8: 'I2', 9: float('nan')}, 'key_name_2': {0: 'k3', 1: 'k2', 2: float('nan'), 3: float('nan'), 4: 'k2', 5: 'k1', 6: float('nan'), 7: float('nan'), 8: float('nan'), 9: 'k3'}, 'key_value_2': {0: 'A3', 1: 'B2', 2: float('nan'), 3: float('nan'), 4: 'E2', 5: 'F1', 6: float('nan'), 7: float('nan'), 8: float('nan'), 9: 'J3'}} )
df.rename(columns={'key_name_1':'key_name','key_value_1':'key_value','key_name_2':'key_name','key_value_2':'key_value'}, inplace=True)

Notice the repeated "key_name", "key_value" pair columns after owner. These columns hold the metadata values, so key_name is the metadata name, and key_value is its value. Now there can be different number of metadatas for an entity, so for some rows, there maybe 15-20 such pair columns.

Other rows will have NaN if that metadata does not apply for that entity.

I want to reshape this dataframe into the below one, where each key_name becomes column, and the values in that column should be the key_value for each entity (wherever applicable):

idx    name   types   value value_date        desc   group   owner      k1      k2      k3
  1   name1   type1      45 2021-05-30 name1-type1      G1      O1      A1      A2      A3
  2   name1   type2      23 2021-05-30 name1-type2      G1      O1     NaN      B2      B3
  3   name2   type1      41 2021-05-30 name2-type1      G2      O2      C1     NaN     NaN
  4   name3   type1      39 2021-05-30 name3-type1      G1      O1      D1     NaN      D3
  5   name4   type1      40 2021-05-30 name4-type1      G3      O3      E1      E2      E3
  6   name4   type2      21 2021-05-30 name4-type2      G3      O3      F1      F2      F3
  7   name4   type3      11 2021-05-30 name4-type3      G3      O3     NaN     NaN     NaN
  8   name5   type1      44 2021-05-30 name5-type1      G1      O1      H1     NaN     NaN
  9   name6   type1      49 2021-05-30 name6-type1      G2      O2     NaN      I2     NaN
 10   name6   type2      26 2021-05-30 name6-type2      G2      O2      J1     NaN      J3

Now the problem is that these "key_name", "key_value" pair data are not in same order for each entity in the problem dataframe. If you see for name1-type1, (k1, A1) appear first in the order, then (k2, A2), .... but for name4-type2, (k3, F3) appear first followed by k2 and then k1, totally opposite.

What I suspect is happening in the API is that for each unique entity, data is being fetched in loop and then concatenated later, so concatenation works but the order of metadata is not guaranteed then. I cannot fix the API as its not mine, so I need to fix its output to continue further.

Upvotes: 2

Views: 378

Answers (2)

sammywemmy
sammywemmy

Reputation: 28644

A combination of helper tools from pyjanitor may help:

# pip install pyjanitor
import janitor
import pandas as pd
(df.pivot_longer(index = slice("idx", "owner"), 
                 names_to=("names", "values"), 
                 names_pattern=["name", "value"], 
                 values_to="valued")
  .dropna()
  .pivot_wider(index = slice("idx", "owner"), 
               names_from="names")
  .merge(df.loc[:, "idx":"owner"], 
         how = 'outer')
  .sort_values('idx')
  )

    idx   name  types  value  value_date         desc group owner   k1   k3   k2
0    1  name1  type1     45  2021-05-30  name1-type1    G1    O1   A1   A3   A2
1    2  name1  type2     23  2021-05-30  name1-type2    G1    O1  NaN   B3   B2
2    3  name2  type1     41  2021-05-30  name2-type1    G2    O2   C1  NaN  NaN
3    4  name3  type1     39  2021-05-30  name3-type1    G1    O1   D1   D3  NaN
4    5  name4  type1     40  2021-05-30  name4-type1    G3    O3   E1   E3   E2
5    6  name4  type2     21  2021-05-30  name4-type2    G3    O3   F1   F3   F2
9    7  name4  type3     11  2021-05-30  name4-type3    G3    O3  NaN  NaN  NaN
6    8  name5  type1     44  2021-05-30  name5-type1    G1    O1   H1  NaN  NaN
7    9  name6  type1     49  2021-05-30  name6-type1    G2    O2  NaN  NaN   I2
8   10  name6  type2     26  2021-05-30  name6-type2    G2    O2   J1   J3  NaN

pivot_longer just provides an easier way to reshape from wide to long - for your case, there is a pattern (some columns have name, some .value) - we use that pattern in names_pattern to flip the table. pivot_wider is built on pandas pivot and is just syntactic sugar in this case - the only reason used here is just because I didnt want to type out all the index names :), but you can skip it and just use pivot. The outcome excludes idx 7, since it is completely null ... merging back with the original df reintroduces it.

They are just helpers, so we can afford not to include them. Here is a possible way to achieve this:

Melt the columns, retaining the original index:

index = [*df.columns[:8]]

base = (df.melt(index, value_name = "valued", ignore_index = False)
          .dropna()
          .drop(columns="variable"))

extract dataframe where valued only starts with K, and while at it, append index to existing index:

Ks = (base.loc[base.valued.str.startswith("k")]
          .set_index([*index], append = True)
      )

extract dataframe where valued does not start with K:

non_Ks = (base.loc[~base.valued.str.startswith("k")]
              .rename(columns={"valued":"flip_this_column"})
              .set_index(index, append = True)
           )

Combined Ks with non_Ks, drop the irrelevant labels and reset index :

(pd.concat([Ks, non_Ks], axis = 'columns')
   .set_index('valued',append=True)
   .unstack()
   .droplevel(0, axis = 'columns')
   .rename_axis(columns=None)
   .reset_index(index)
)



idx   name  types  value  value_date         desc group owner   k1   k2   k3
0    1  name1  type1     45  2021-05-30  name1-type1    G1    O1   A1   A2   A3
1    2  name1  type2     23  2021-05-30  name1-type2    G1    O1  NaN   B2   B3
2    3  name2  type1     41  2021-05-30  name2-type1    G2    O2   C1  NaN  NaN
3    4  name3  type1     39  2021-05-30  name3-type1    G1    O1   D1  NaN   D3
4    5  name4  type1     40  2021-05-30  name4-type1    G3    O3   E1   E2   E3
5    6  name4  type2     21  2021-05-30  name4-type2    G3    O3   F1   F2   F3
7    8  name5  type1     44  2021-05-30  name5-type1    G1    O1   H1  NaN  NaN
8    9  name6  type1     49  2021-05-30  name6-type1    G2    O2  NaN   I2  NaN
9   10  name6  type2     26  2021-05-30  name6-type2    G2    O2   J1  NaN   J3

Upvotes: 1

Nk03
Nk03

Reputation: 14949

STEPS:

  1. filter those columns that start with key along with idx.
  2. set_index('idx') use the reshape to combine key and values together.
  3. The explode the dataframe and use pd.Series to expand key and value in separate columns.
  4. use pivot to restructure the dataframe.
  5. Finally, merge with original
df1 = df.filter(regex=r'idx|key*')
df2 = df1.set_index('idx').apply(lambda x: x.dropna(
).values.reshape(-1, 2), axis=1).explode().dropna().apply(pd.Series).reset_index()
df2 = df2.pivot(*df2).reset_index()
df = df[[col for col in df.columns if not col.startswith('key')]].merge(
    df2, on='idx', how='left')

OUTPUT

   idx   name  types  value  value_date         desc group owner   k1   k2  \
0    1  name1  type1     45  2021-05-30  name1-type1    G1    O1   A1   A2   
1    2  name1  type2     23  2021-05-30  name1-type2    G1    O1  NaN   B2   
2    3  name2  type1     41  2021-05-30  name2-type1    G2    O2   C1  NaN   
3    4  name3  type1     39  2021-05-30  name3-type1    G1    O1   D1  NaN   
4    5  name4  type1     40  2021-05-30  name4-type1    G3    O3   E1   E2   
5    6  name4  type2     21  2021-05-30  name4-type2    G3    O3   F1   F2   
6    7  name4  type3     11  2021-05-30  name4-type3    G3    O3  NaN  NaN   
7    8  name5  type1     44  2021-05-30  name5-type1    G1    O1   H1  NaN   
8    9  name6  type1     49  2021-05-30  name6-type1    G2    O2  NaN   I2   
9   10  name6  type2     26  2021-05-30  name6-type2    G2    O2   J1  NaN   

    k3  
0   A3  
1   B3  
2  NaN  
3   D3  
4   E3  
5   F3  
6  NaN  
7  NaN  
8  NaN  
9   J3  

Upvotes: 1

Related Questions