Reputation: 3452
I have a Dataframe df as follows
df = pd.DataFrame(
[['tom', 28, "tesla", "model s" ],
['nick', 30, "ford", "mustang"],
['juli', 56, "tesla", "model s"]] ,
columns = ['PersonName', 'PersonAge', "Car", "CarModel"]
)
df.head()
PersonName PersonAge Car CarModel
0 tom 28 tesla model s
1 nick 30 ford mustang
2 juli 56 tesla model s
I would like to transform it into two dataframe as follows
Person=
PersonName PersonAge CarId
tom 28 1
nick 30 2
juli 56 1
Cars=
id Car Model
1 Tesla Model S
2 Ford Mustang
What would be the correct way to do this?
Upvotes: 0
Views: 430
Reputation: 1247
The following gives you what you want.
Use groupby to create Cars dataframe
CarsDF = pd.DataFrame(df.groupby(['Car','CarModel']).Car.count())
CarsDF.columns=['v']
CarsDF = CarsDF.reset_index().drop('v', axis=1)
CarsDF.index.name='CarId'
CarsDF
Then use merge to add CarId
PersonDF = df.merge(CarsDF.reset_index())[['PersonName','PersonAge','CarId']]
Upvotes: 1
Reputation: 77347
You can assign columns to new dataframes to create Person
and Cars
.
>>> df = pd.DataFrame(
... [['tom', 28, "tesla", "model s" ],
... ['nick', 30, "ford", "mustang"],
... ['juli', 56, "tesla", "model s"]] ,
... columns = ['PersonName', 'PersonAge', "Car", "CarModel"]
... )
>>> df
PersonName PersonAge Car CarModel
0 tom 28 tesla model s
1 nick 30 ford mustang
2 juli 56 tesla model s
>>>
>>> Person = df[["PersonName", "PersonAge"]]
>>> Cars = df[["Car", "CarModel"]]
>>> Person
PersonName PersonAge
0 tom 28
1 nick 30
2 juli 56
>>> Cars
Car CarModel
0 tesla model s
1 ford mustang
2 tesla model s
Notice that your original dataframe had an index and that index is preserved when creating the new ones. That index can be used as your foreign key. In fact, the indexing part is frequently transparent in pandas. To select a car from the person name, you could
>>> Cars[Person["PersonName"] == "nick"]
Car CarModel
1 ford mustang
If you want to turn the index into its own column you can do so
>>> Cars["id"] = Cars.index
>>> Person["CarId"] = Person.index
>>> Person
PersonName PersonAge CarId
0 tom 28 0
1 nick 30 1
2 juli 56 2
>>> Cars
Car CarModel id
0 tesla model s 0
1 ford mustang 1
2 tesla model s 2
But sticking with operations on the index is generally more convenient.... unless you reset the index, then, you have to go with a different column
Upvotes: 2
Reputation: 1669
This should work:
import pandas as pd
df = pd.DataFrame(
[['tom', 28, "tesla", "model s" ],
['nick', 30, "ford", "mustang"],
['juli', 56, "tesla", "model s"]] ,
columns = ['PersonName', 'PersonAge', "Car", "CarModel"]
)
df['Comb'] = df['Car'] + "," + df['CarModel']
car_df = pd.DataFrame({'CarId': range(1, len(df['Comb'].unique())+1), 'Comb': df['Comb'].unique()})
car_df = car_df.merge(df[['Comb', 'Car', 'CarModel']], on=['Comb']).drop_duplicates(subset=['Comb']).drop(columns=['Comb'])
df = df.merge(car_df, on=['Car', 'CarModel']).drop(columns=['Comb', 'Car', 'CarModel'])
car_df = car_df.set_index('CarId')
print(df)
print()
print(car_df)
Output:
PersonName PersonAge CarId
0 tom 28 1
1 juli 56 1
2 nick 30 2
Car CarModel
CarId
1 tesla model s
2 ford mustang
Upvotes: 1