Numbaker
Numbaker

Reputation: 81

Extracting useful info from pandas column containing dict

Category    Data     Age
A1            30     {'Age1_Set': 25.6, 'WIndex': 343.3, 'Age2_Set': 22.6}
A2            20     {'Age1_Set': 35.2, 'WIndex': 343.3, 'Age2_Set': 42.1}
A3            20     {'Age1_Set': 26.5, 'WIndex': 343.3, 'Age2_Set': 11.1}

My pandas df is like above. Age should be the min of 'Age1_Set' and 'Age2_Set'. I am able to deal with this if Age was not a dict and rather three columns.

My output is:

Category    Data     Age
A1            30     22.6
A2            20     35.2
A3            20     11.1

How can this be done?

EDIT: I have another entry in original df

Category    Data     Age
A4            20    

Age is null basically. How to deal with such case.

Upvotes: 2

Views: 60

Answers (4)

Ivo Merchiers
Ivo Merchiers

Reputation: 1688

If you only need to compare these two values of the dict, you can use the following method:

df['Age']=df['Age'].apply(lambda x: min(x["Age1_Set"], x["Age2_Set"]))

This avoids the hassle of converting the dictionary to other data formats and is IMO easily readable.

If you need to more complex logic, then you might want to extract the lambda into its own function. For example, when you need to deal with entries that are None:

def get_min_age(entry:dict)->float:
    if entry is None:
        return 0
    else:
        return min(entry["age1"], entry["age2"])
df['Age']=df['Age'].apply(get_min_age)

A note on performance:

Using .apply() is not an ideal approach when dealing with large datasets (>100k rows), since it treats every row in a sequential way.

If you notice that performance becomes a bottleneck, the vectorized methods in the other answers will be more performant (as this article explains in more detail).

Upvotes: 2

René
René

Reputation: 4827

You can try:

df = pd.DataFrame({"Category": ["A1", "A2", "A3"],
                   "Data": [30, 20, 20],
                   "Age": [{'Age1_Set': 25.6, 'WIndex': 343.3, 'Age2_Set': 22.6}, {'Age1_Set': 35.2, 'WIndex': 343.3, 'Age2_Set': 42.1}, {'Age1_Set': 26.5, 'WIndex': 343.3, 'Age2_Set': 11.1}]})

df["Min_Age"] = [min(x["Age1_Set"], x["Age2_Set"]) for x in df.Age]

print(df)

Result:

  Category  Data                                                Age  Min_Age
0       A1    30  {'Age1_Set': 25.6, 'WIndex': 343.3, 'Age2_Set'...     22.6
1       A2    20  {'Age1_Set': 35.2, 'WIndex': 343.3, 'Age2_Set'...     35.2
2       A3    20  {'Age1_Set': 26.5, 'WIndex': 343.3, 'Age2_Set'...     11.1

Upvotes: 1

Shubham Sharma
Shubham Sharma

Reputation: 71689

Create a new dataframe from the dictionaries in the Age column, then filter the Age like columns and use min along axis=1 to get the minimum of age from Age1_Set and Age2_Set:

df['Age'] = pd.DataFrame(df['Age'].tolist()).filter(like='Age').min(1)

Alternatively you can also use Series.str.get + np.minimum:

df['Age'] = np.minimum(df['Age'].str.get('Age1_Set'), df['Age'].str.get('Age2_Set'))

  Category  Data   Age
0       A1    30  22.6
1       A2    20  35.2
2       A3    20  11.1

Upvotes: 3

Jimmys
Jimmys

Reputation: 377

You could just pass your data column without any extra params.

import pandas as pd

df2 = pd.json_normalize(df['Age'])

Upvotes: 0

Related Questions