Reputation: 117
I have a pandas dataframe that has a list of values inside a cell. I need to convert these values into columns containing true or false if the column value is inside the list for that row. I need a column for every unique value inside every row's list.
This is my dataframe:
data = [
{"agency_id": 1,"province": ["CH", "PE"]},
{"agency_id": 3,"province": ["CH", "CS"]}
]
df = pd.DataFrame(data)
agency_id province
0 1 [CH, PE]
1 3 [CH, CS]
To create the intial dataframe.
Then I tried:
df2 = pd.DataFrame(df['province'].values.tolist(),index=df['agency_id'])
and it outputs this:
0 1 2 3 4 5 6 7
agency_id
1 CH PE AQ TE None None None None
3 KR CS None None None None None None
7 FE FC BO MO RA RE RN PR
8 None None None None None None None None
10 RM None None None None None None None
11 RM None None None None None None None
But it's not what I want because the columns are not "aligned".
I need something like this:
agency_id CH PE CS
1 true true false
3 true false true
Upvotes: 3
Views: 140
Reputation: 61910
Another solution, just using pandas
:
import pandas as pd
data = [
{"agency_id": 1,"province": ["CH", "PE"]},
{"agency_id": 3,"province": ["CH", "CS"]}
]
df = pd.DataFrame(data)
result = df['province'].apply(lambda x: '|'.join(x)).str.get_dummies().astype(bool).set_index(df.agency_id)
print(result)
Output
CH CS PE
agency_id
1 True False True
3 True True False
Upvotes: 1
Reputation: 51673
You can clean up / modify your data
if you do not like to import from sklearn.preprocessing import MultiLabelBinarizer
for this:
import pandas as pd
data = [
{"agency_id": 1,"province": ["CH", "PE"]},
{"agency_id": 3,"province": ["CH", "CS"]}
]
# get all provinces from any included dictionaries of data:
all_prov = sorted(set( (x for y in [d["province"] for d in data] for x in y) ))
# add the missing key:values to your data's dicts:
for d in data:
for p in all_prov:
d[p] = p in d["province"]
print(data)
df = pd.DataFrame(data)
print(df)
Output:
# data
[{'agency_id': 1, 'province': ['CH', 'PE'], 'CH': True, 'CS': False, 'PE': True},
{'agency_id': 3, 'province': ['CH', 'CS'], 'CH': True, 'CS': True, 'PE': False}]
# df
CH CS PE agency_id province
0 True False True 1 [CH, PE]
1 True True False 3 [CH, CS]
Upvotes: 2
Reputation: 323306
From sklearn
MultiLabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(df['province']),columns=mlb.classes_, index=df.agency_id).astype(bool)
Out[90]:
CH CS PE
agency_id
1 True False True
3 True True False
Upvotes: 3