Reputation: 25
I'm new to pandas.
Consider you have a state in which you have a pandas Dataframe
structure of columns like below:
user_id | timestamp | foo_name1 | foo_name2 | foo_name3
As we can see Dataframe
has several metadata parameters, having raw string values:
user_id, timestamp
and several dynamic name columns - which have a string value of json within each: foo_name1..foo_name3
Example for the structure of json within foo_name1
col (which has a fixed hierarchy, dict keys and values may vary within):
{"foo_att1": "foo_value1","foo_att2": "foo_value2"}
So my will is to end up with this kind of DF structure instead - kind of expending:
DF:
user_id | timestamp | foo_name1-foo_att1 | foo_name1-foo_att2 | foo_name2-foo_att1 | foo_name2-foo_att2
Whereasfoo_name1-foo_att1
will have as value: "foo_value1"
foo_name1-foo_att2
will have value : "foo_value2"
Etc...
How can I achieve this using pandas actions?
Upvotes: 0
Views: 95
Reputation: 31166
pd.concat(axis=1)
and pd.json_normalize()
gets you to your answerdict
comprehension to name columns as per your requirementdf = pd.DataFrame([{**{"userid": random.randint(1,100),
"timestamp":dt.datetime(2020,8,9,random.randint(0,23))},
**{f"foo_name{f+1}":{f"foo_att{ff+1}":random.randint(1,10) for ff in range(2)} for f in range(3)}
} for r in range(5)])
df2 = pd.concat([
df.loc[:,["userid", "timestamp"]]] + # the fixed columns
[
pd.DataFrame({f"{c}.{k}":v # rename columns as per requirement
for k,v in
# json_normalize is doing all the work, but requirement to rename columns....
pd.json_normalize(df[c]).to_dict(orient="list").items()})
# in dict comprehension loop through all foo columns to turn them from dicts to columns
for c in df.columns if "foo" in c
], axis=1)
print(f"{df.to_string(index=False)}\n\n{df2.to_string(index=False)}")
output
userid timestamp foo_name1 foo_name2 foo_name3
94 2020-08-09 15:00:00 {'foo_att1': 1, 'foo_att2': 7} {'foo_att1': 5, 'foo_att2': 2} {'foo_att1': 5, 'foo_att2': 9}
61 2020-08-09 05:00:00 {'foo_att1': 2, 'foo_att2': 3} {'foo_att1': 10, 'foo_att2': 9} {'foo_att1': 10, 'foo_att2': 4}
69 2020-08-09 12:00:00 {'foo_att1': 9, 'foo_att2': 6} {'foo_att1': 7, 'foo_att2': 8} {'foo_att1': 5, 'foo_att2': 9}
16 2020-08-09 13:00:00 {'foo_att1': 2, 'foo_att2': 1} {'foo_att1': 7, 'foo_att2': 8} {'foo_att1': 9, 'foo_att2': 1}
51 2020-08-09 11:00:00 {'foo_att1': 3, 'foo_att2': 1} {'foo_att1': 7, 'foo_att2': 1} {'foo_att1': 6, 'foo_att2': 3}
userid timestamp foo_name1.foo_att1 foo_name1.foo_att2 foo_name2.foo_att1 foo_name2.foo_att2 foo_name3.foo_att1 foo_name3.foo_att2
94 2020-08-09 15:00:00 1 7 5 2 5 9
61 2020-08-09 05:00:00 2 3 10 9 10 4
69 2020-08-09 12:00:00 9 6 7 8 5 9
16 2020-08-09 13:00:00 2 1 7 8 9 1
51 2020-08-09 11:00:00 3 1 7 1 6 3
Better approach to picking columns to be normalised.
df = pd.DataFrame([{**{"userid": random.randint(1,100),
"timestamp":dt.datetime(2020,8,9,random.randint(0,23))},
**{f"foo_name{f+1}":{f"foo_att{ff+1}":random.randint(1,10) for ff in range(2)} for f in range(3)},
**{"foo_namex":np.nan},
**{"foo_namey":"hello"}
} for r in range(5)])
# which columns do we want to convert?
convert = [c[0]
for c in df.dtypes.items()
if "foo" in c[0]
and c[1].name=="object"
and isinstance(df.loc[0,c[0]], dict)]
df2 = pd.concat([
df.loc[:,[c for c in df.columns if c not in convert]]] + # keep the columns not being converted
[
pd.DataFrame({f"{c}.{k}":v # rename columns as per requirement
for k,v in
# json_normalize is doing all the work, but requirement to rename columns....
pd.json_normalize(df[c]).to_dict(orient="list").items()})
# already worked out list of columns to convert
for c in convert
], axis=1)
Upvotes: 1