Reputation: 3
I have a dataframe that is in the following format:
geo_locations feature mau_audience
0 Aabenraa Alcholic Drinks,Android users,Architecture,Art... 3.380211241711606,3.230448921378274,3.0,3.5910...
1 Aalborg Alcholic Drinks,Android users,Architecture,Art... 4.113943352306837,3.8920946026904804,3.7853298...
2 Assens Alcholic Drinks,Android users,Architecture,Art... 3.041392685158225,3.0,3.0,3.255272505103306,3....
3 Billund Alcholic Drinks,Android users,Architecture,Art... 3.0,3.0,3.0,3.1760912590556813,3.1461280356782...
4 Bornholm Alcholic Drinks,Android users,Architecture,Art... 3.0,3.0,3.0,3.113943352306837,3.04139268515822...
5 Esbjerg Alcholic Drinks,Android users,Architecture,Art... 3.792391689498254,3.5563025007672873,3.5314789...
It is very annoying to work with this format, as I have to look at the indexes of the cells to get the mau_audience value for a feature in a given location. Instead of this format, it would be a lot easier to work with it in the following format:
geo_locations Alcoholic Drinks Android Users Architecture
0 Aabenraa 3.380211241711606 3.230448921378274. 3.0
1 Aalborg 4.113943352306837. 3.8920946026904804 3.7853298
2 Assens 3.041392685158225 3.0. 3.0
3 Billund 3.0 3.0, 3.0
4 Bornholm 3.0 3.0 3.0
5 Esbjerg 3.792391689498254 3.5563025007672873 3.5314789
How would I go about reformating this dataframe? Thanks!
Upvotes: 0
Views: 48
Reputation: 1511
Assuming that every comma separated value in feature has a corresponding value in mau_audience.
data = pd.DataFrame([
['Aabenraa','Alcholic Drinks,Android users,Architecture', '3.380211241711606,3.230448921378274,3.0'],
['Aalborg','Alcholic Drinks,Android users,Architecture','4.113943352306837,3.8920946026904804,3.7853298']],
columns = ['geo_locations','feature','mau_audience'])
columns = data.loc[0,'feature'].split(',')
Assumed from first record for feature.
new = data['mau_audience'].str.split(',', expand=True)
new.columns = columns
new = new.astype('float')
data = data[['geo_locations']].merge(new,left_index=True, right_index=True)
Keeping only geo_locations from the original DataFrame.
geo_locations Alcholic Drinks Android users Architecture
0 Aabenraa 3.380211 3.230449 3.00000
1 Aalborg 4.113943 3.892095 3.78533
Upvotes: 2
Reputation: 641
import pandas as pd
df=pd.DataFrame({"A":["a,b,c","a,b,d"],"B":["1,2,3","3,4,5"],"C":['a','b']})
df.A=df.A.apply(lambda x : x.split(','))
df.B=df.B.apply(lambda x : x.split(','))
final_df=df[['C']]
temp_df = df.apply(lambda row: dict(zip(row["A"], row["B"])), axis=1)
temp_df would be of this format
0 {'a': '1', 'b': '2', 'c': '3'}
1 {'a': '3', 'b': '4', 'd': '5'}
temp_df_1=temp_df.apply(pd.Series)
final_df=pd.concat([final_df, temp_df_1], axis=1)
final_df
C a b c d
0 a 1 2 3 NaN
1 b 3 4 NaN 5
Upvotes: 1