Reputation: 1704
I am receiving a dataframe from an API that has data for some entities in the inital columns, and the later columns contain metadata for those entities. The data is proprietary so I cannot show it exactly, but the structure of dataframe is exactly like below:
idx name types value value_date desc group owner key_name key_value key_name key_value key_name key_value
1 name1 type1 45 2021-05-30 name1-type1 G1 O1 k1 A1 k2 A2 k3 A3
2 name1 type2 23 2021-05-30 name1-type2 G1 O1 k3 B3 NaN NaN k2 B2
3 name2 type1 41 2021-05-30 name2-type1 G2 O2 NaN NaN k1 C1 NaN NaN
4 name3 type1 39 2021-05-30 name3-type1 G1 O1 k1 D1 k3 D3 NaN NaN
5 name4 type1 40 2021-05-30 name4-type1 G3 O3 k1 E1 k3 E3 k2 E2
6 name4 type2 21 2021-05-30 name4-type2 G3 O3 k3 F3 k2 F2 k1 F1
7 name4 type3 11 2021-05-30 name4-type3 G3 O3 NaN NaN NaN NaN NaN NaN
8 name5 type1 44 2021-05-30 name5-type1 G1 O1 NaN NaN k1 H1 NaN NaN
9 name6 type1 49 2021-05-30 name6-type1 G2 O2 NaN NaN k2 I2 NaN NaN
10 name6 type2 26 2021-05-30 name6-type2 G2 O2 k1 J1 NaN NaN k3 J3
Below code will generate the above sample dataframe:
df = pd.DataFrame( {'idx': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10}, 'name': {0: 'name1', 1: 'name1', 2: 'name2', 3: 'name3', 4: 'name4', 5: 'name4', 6: 'name4', 7: 'name5', 8: 'name6', 9: 'name6'}, 'types': {0: 'type1', 1: 'type2', 2: 'type1', 3: 'type1', 4: 'type1', 5: 'type2', 6: 'type3', 7: 'type1', 8: 'type1', 9: 'type2'}, 'value': {0: 45, 1: 23, 2: 41, 3: 39, 4: 40, 5: 21, 6: 11, 7: 44, 8: 49, 9: 26}, 'value_date': {0: '2021-05-30', 1: '2021-05-30', 2: '2021-05-30', 3: '2021-05-30', 4: '2021-05-30', 5: '2021-05-30', 6: '2021-05-30', 7: '2021-05-30', 8: '2021-05-30', 9: '2021-05-30'}, 'desc': {0: 'name1-type1', 1: 'name1-type2', 2: 'name2-type1', 3: 'name3-type1', 4: 'name4-type1', 5: 'name4-type2', 6: 'name4-type3', 7: 'name5-type1', 8: 'name6-type1', 9: 'name6-type2'}, 'group': {0: 'G1', 1: 'G1', 2: 'G2', 3: 'G1', 4: 'G3', 5: 'G3', 6: 'G3', 7: 'G1', 8: 'G2', 9: 'G2'}, 'owner': {0: 'O1', 1: 'O1', 2: 'O2', 3: 'O1', 4: 'O3', 5: 'O3', 6: 'O3', 7: 'O1', 8: 'O2', 9: 'O2'}, 'key_name': {0: 'k1', 1: 'k3', 2: float('nan'), 3: 'k1', 4: 'k1', 5: 'k3', 6: float('nan'), 7: float('nan'), 8: float('nan'), 9: 'k1'}, 'key_value': {0: 'A1', 1: 'B3', 2: float('nan'), 3: 'D1', 4: 'E1', 5: 'F3', 6: float('nan'), 7: float('nan'), 8: float('nan'), 9: 'J1'}, 'key_name_1': {0: 'k2', 1: float('nan'), 2: 'k1', 3: 'k3', 4: 'k3', 5: 'k2', 6: float('nan'), 7: 'k1', 8: 'k2', 9: float('nan')}, 'key_value_1': {0: 'A2', 1: float('nan'), 2: 'C1', 3: 'D3', 4: 'E3', 5: 'F2', 6: float('nan'), 7: 'H1', 8: 'I2', 9: float('nan')}, 'key_name_2': {0: 'k3', 1: 'k2', 2: float('nan'), 3: float('nan'), 4: 'k2', 5: 'k1', 6: float('nan'), 7: float('nan'), 8: float('nan'), 9: 'k3'}, 'key_value_2': {0: 'A3', 1: 'B2', 2: float('nan'), 3: float('nan'), 4: 'E2', 5: 'F1', 6: float('nan'), 7: float('nan'), 8: float('nan'), 9: 'J3'}} )
df.rename(columns={'key_name_1':'key_name','key_value_1':'key_value','key_name_2':'key_name','key_value_2':'key_value'}, inplace=True)
Notice the repeated "key_name", "key_value" pair columns after owner. These columns hold the metadata values, so key_name
is the metadata name, and key_value
is its value. Now there can be different number of metadatas for an entity, so for some rows, there maybe 15-20 such pair columns.
Other rows will have NaN
if that metadata does not apply for that entity.
I want to reshape this dataframe into the below one, where each key_name
becomes column, and the values in that column should be the key_value
for each entity (wherever applicable):
idx name types value value_date desc group owner k1 k2 k3
1 name1 type1 45 2021-05-30 name1-type1 G1 O1 A1 A2 A3
2 name1 type2 23 2021-05-30 name1-type2 G1 O1 NaN B2 B3
3 name2 type1 41 2021-05-30 name2-type1 G2 O2 C1 NaN NaN
4 name3 type1 39 2021-05-30 name3-type1 G1 O1 D1 NaN D3
5 name4 type1 40 2021-05-30 name4-type1 G3 O3 E1 E2 E3
6 name4 type2 21 2021-05-30 name4-type2 G3 O3 F1 F2 F3
7 name4 type3 11 2021-05-30 name4-type3 G3 O3 NaN NaN NaN
8 name5 type1 44 2021-05-30 name5-type1 G1 O1 H1 NaN NaN
9 name6 type1 49 2021-05-30 name6-type1 G2 O2 NaN I2 NaN
10 name6 type2 26 2021-05-30 name6-type2 G2 O2 J1 NaN J3
Now the problem is that these "key_name", "key_value" pair data are not in same order for each entity in the problem dataframe. If you see for name1-type1, (
k1
, A1)
appear first in the order, then (
k2
, A2)
, .... but for name4-type2, (
k3
, F3)
appear first followed by k2
and then k1
, totally opposite.
What I suspect is happening in the API is that for each unique entity, data is being fetched in loop and then concatenated later, so concatenation works but the order of metadata is not guaranteed then. I cannot fix the API as its not mine, so I need to fix its output to continue further.
Upvotes: 2
Views: 378
Reputation: 28644
A combination of helper tools from pyjanitor may help:
# pip install pyjanitor
import janitor
import pandas as pd
(df.pivot_longer(index = slice("idx", "owner"),
names_to=("names", "values"),
names_pattern=["name", "value"],
values_to="valued")
.dropna()
.pivot_wider(index = slice("idx", "owner"),
names_from="names")
.merge(df.loc[:, "idx":"owner"],
how = 'outer')
.sort_values('idx')
)
idx name types value value_date desc group owner k1 k3 k2
0 1 name1 type1 45 2021-05-30 name1-type1 G1 O1 A1 A3 A2
1 2 name1 type2 23 2021-05-30 name1-type2 G1 O1 NaN B3 B2
2 3 name2 type1 41 2021-05-30 name2-type1 G2 O2 C1 NaN NaN
3 4 name3 type1 39 2021-05-30 name3-type1 G1 O1 D1 D3 NaN
4 5 name4 type1 40 2021-05-30 name4-type1 G3 O3 E1 E3 E2
5 6 name4 type2 21 2021-05-30 name4-type2 G3 O3 F1 F3 F2
9 7 name4 type3 11 2021-05-30 name4-type3 G3 O3 NaN NaN NaN
6 8 name5 type1 44 2021-05-30 name5-type1 G1 O1 H1 NaN NaN
7 9 name6 type1 49 2021-05-30 name6-type1 G2 O2 NaN NaN I2
8 10 name6 type2 26 2021-05-30 name6-type2 G2 O2 J1 J3 NaN
pivot_longer just provides an easier way to reshape from wide to long - for your case, there is a pattern (some columns have name
, some .value
) - we use that pattern in names_pattern
to flip the table. pivot_wider is built on pandas pivot
and is just syntactic sugar in this case - the only reason used here is just because I didnt want to type out all the index names :), but you can skip it and just use pivot
. The outcome excludes idx
7, since it is completely null ... merging back with the original df
reintroduces it.
They are just helpers, so we can afford not to include them. Here is a possible way to achieve this:
Melt the columns, retaining the original index:
index = [*df.columns[:8]]
base = (df.melt(index, value_name = "valued", ignore_index = False)
.dropna()
.drop(columns="variable"))
extract dataframe where valued
only starts with K
, and while at it, append index
to existing index:
Ks = (base.loc[base.valued.str.startswith("k")]
.set_index([*index], append = True)
)
extract dataframe where valued
does not start with K
:
non_Ks = (base.loc[~base.valued.str.startswith("k")]
.rename(columns={"valued":"flip_this_column"})
.set_index(index, append = True)
)
Combined Ks
with non_Ks
, drop the irrelevant labels and reset index :
(pd.concat([Ks, non_Ks], axis = 'columns')
.set_index('valued',append=True)
.unstack()
.droplevel(0, axis = 'columns')
.rename_axis(columns=None)
.reset_index(index)
)
idx name types value value_date desc group owner k1 k2 k3
0 1 name1 type1 45 2021-05-30 name1-type1 G1 O1 A1 A2 A3
1 2 name1 type2 23 2021-05-30 name1-type2 G1 O1 NaN B2 B3
2 3 name2 type1 41 2021-05-30 name2-type1 G2 O2 C1 NaN NaN
3 4 name3 type1 39 2021-05-30 name3-type1 G1 O1 D1 NaN D3
4 5 name4 type1 40 2021-05-30 name4-type1 G3 O3 E1 E2 E3
5 6 name4 type2 21 2021-05-30 name4-type2 G3 O3 F1 F2 F3
7 8 name5 type1 44 2021-05-30 name5-type1 G1 O1 H1 NaN NaN
8 9 name6 type1 49 2021-05-30 name6-type1 G2 O2 NaN I2 NaN
9 10 name6 type2 26 2021-05-30 name6-type2 G2 O2 J1 NaN J3
Upvotes: 1
Reputation: 14949
key
along with idx
.set_index('idx')
use the reshape
to combine key and values
together.pd.Series
to expand key
and value
in separate columns.pivot
to restructure the dataframe.df1 = df.filter(regex=r'idx|key*')
df2 = df1.set_index('idx').apply(lambda x: x.dropna(
).values.reshape(-1, 2), axis=1).explode().dropna().apply(pd.Series).reset_index()
df2 = df2.pivot(*df2).reset_index()
df = df[[col for col in df.columns if not col.startswith('key')]].merge(
df2, on='idx', how='left')
idx name types value value_date desc group owner k1 k2 \
0 1 name1 type1 45 2021-05-30 name1-type1 G1 O1 A1 A2
1 2 name1 type2 23 2021-05-30 name1-type2 G1 O1 NaN B2
2 3 name2 type1 41 2021-05-30 name2-type1 G2 O2 C1 NaN
3 4 name3 type1 39 2021-05-30 name3-type1 G1 O1 D1 NaN
4 5 name4 type1 40 2021-05-30 name4-type1 G3 O3 E1 E2
5 6 name4 type2 21 2021-05-30 name4-type2 G3 O3 F1 F2
6 7 name4 type3 11 2021-05-30 name4-type3 G3 O3 NaN NaN
7 8 name5 type1 44 2021-05-30 name5-type1 G1 O1 H1 NaN
8 9 name6 type1 49 2021-05-30 name6-type1 G2 O2 NaN I2
9 10 name6 type2 26 2021-05-30 name6-type2 G2 O2 J1 NaN
k3
0 A3
1 B3
2 NaN
3 D3
4 E3
5 F3
6 NaN
7 NaN
8 NaN
9 J3
Upvotes: 1