Normalize Pandas DataFrame like Databases Tables

Question

I'd like split a dataframe, which include 2 types of data (brand owners and products) and into 2 dataframes.

Original dataframe:

>>> products
        product_id     brand_owner     product_name
0       344606         Cargill         A
1       344607         Red Gold        B
2       344608         FooBar          C
3       344609         Red Gold        D
4       344610         Cargill         E

I' like to extract brand_owner into another dataframe, much like normalizing a database:

>>> brand_owners = pd.DataFrame(branded_foods['brand_owner'].unique())
>>> brand_owners
                        0
0                 Cargill
1      Kellogg Company Us
2                Kashi Us
3                Red Gold
4          Conagra Brands
...                   ...

I gave its rows an ID (again, as a database primary key)

>>> brand_owners.index += 1
>>> brand_owners['id'] = brand_owners.index
>>> brand_owners
                        0     id
1                 Cargill      1
2      Kellogg Company Us      2
3                Kashi Us      3
4                Red Gold      4
5          Conagra Brands      5
...                   ...    ...

[25202 rows x 2 columns]
>>> brand_owners.columns = ['name', 'id']
>>> brand_owners
                     name     id
1                 Cargill      1
2      Kellogg Company Us      2
3                Kashi Us      3
4                Red Gold      4
5          Conagra Brands      5
...                   ...    ...

Now I want to get this IDs back into the original dataframe, so it will look like:

        product_id     brand_owner     product_name
0       344606         1               A
1       344607         4               B
2       344608         45              C
3       344609         4               D
4       344610         1               E

How can I make this update in Pandas: update products p set p.brand_owner = (select id from brand_owners b where b.name = p.brand_owner)

yatu · Accepted Answer

You can directly encode the categories in brand_owner with pd.factorize:

df['brand_owner'] = pd.factorize(df.brand_owner)[0]

print(df)

   product_id  brand_owner product_name
0      344606            0            A
1      344607            1            B
2      344608            2            C
3      344609            1            D
4      344610            0            E

Normalize Pandas DataFrame like Databases Tables

Answers (1)

Related Questions