Reputation: 628
I'd like split a dataframe, which include 2 types of data (brand owners and products) and into 2 dataframes.
Original dataframe:
>>> products
product_id brand_owner product_name
0 344606 Cargill A
1 344607 Red Gold B
2 344608 FooBar C
3 344609 Red Gold D
4 344610 Cargill E
I' like to extract brand_owner into another dataframe, much like normalizing a database:
>>> brand_owners = pd.DataFrame(branded_foods['brand_owner'].unique())
>>> brand_owners
0
0 Cargill
1 Kellogg Company Us
2 Kashi Us
3 Red Gold
4 Conagra Brands
... ...
I gave its rows an ID (again, as a database primary key)
>>> brand_owners.index += 1
>>> brand_owners['id'] = brand_owners.index
>>> brand_owners
0 id
1 Cargill 1
2 Kellogg Company Us 2
3 Kashi Us 3
4 Red Gold 4
5 Conagra Brands 5
... ... ...
[25202 rows x 2 columns]
>>> brand_owners.columns = ['name', 'id']
>>> brand_owners
name id
1 Cargill 1
2 Kellogg Company Us 2
3 Kashi Us 3
4 Red Gold 4
5 Conagra Brands 5
... ... ...
Now I want to get this IDs back into the original dataframe, so it will look like:
product_id brand_owner product_name
0 344606 1 A
1 344607 4 B
2 344608 45 C
3 344609 4 D
4 344610 1 E
How can I make this update in Pandas: update products p set p.brand_owner = (select id from brand_owners b where b.name = p.brand_owner)
Upvotes: 2
Views: 490
Reputation: 88236
You can directly encode the categories in brand_owner
with pd.factorize
:
df['brand_owner'] = pd.factorize(df.brand_owner)[0]
print(df)
product_id brand_owner product_name
0 344606 0 A
1 344607 1 B
2 344608 2 C
3 344609 1 D
4 344610 0 E
Upvotes: 3