Reputation: 36176
looking for ideas on how to solve the problem bellow:
I have a data frame where one of my columns contains a list of tuples like the following:
mydf = pd.DataFrame({
'Field1' : ['A','B','C'],
'Field2' : ['1','2','3'],
'WeirdField' :[
[ ('xxx', 'F1'), ('yyy','F2') ],
[ ('asd', 'F3'), ('bla','F4') ],
[ ('123', 'F2'), ('www','F5') ]
]
})
I would like each element on the second position of the tuple to become a column on the data frame with the corespondent value on the first position. For the data frame above, this is what I am expecting:
the list can have several elements (not only 2 as the example) and the number of elements can vary across the rows.
can anyone suggest how to easily achieve this?
Thanks
Upvotes: 3
Views: 1206
Reputation: 107767
Consider a pivot_table
solution after zipping through column values. This will work in any number of tuples in WeirdField assuming none of the F's repeat in same row as pivot will take max value:
data =[]
# APPEND TO LIST
for f1,f2,w in zip(mydf['Field1'].values, mydf['Field2'].values, mydf['WeirdField'].values):
for i in w:
data.append((f1, f2) + i)
# CAST LIST OF TUPLES TO DATAFRAME
df = pd.DataFrame(data, columns=['Field1', 'Field2', 'Value', 'Indicator'])
# PIVOT DATAFRAME
pvt = df.pivot_table(index=['Field1', 'Field2'], columns=['Indicator'],
values='Value', aggfunc='max', fill_value=np.nan).reset_index()
pvt.columns.name = None
# Field1 Field2 F1 F2 F3 F4 F5
# 0 A 1 xxx yyy NaN NaN NaN
# 1 B 2 NaN NaN asd bla NaN
# 2 C 3 NaN 123 NaN NaN www
Upvotes: 1
Reputation: 4199
import pandas as pd
mydf = pd.DataFrame({
'Field1' : ['A','B','C'],
'Field2' : ['1','2','3'],
'WeirdField' :[
[ ('xxx', 'F1'), ('yyy','F2'),('xyz','F6') ],
[ ('asd', 'F3'), ('bla','F4') ],
[ ('123', 'F2'), ('www','F5') ,('mno','F1') ]
]
})
print mydf.head()
# Create a new data frame with just field1 and field2
newdf = pd.DataFrame({'Field1' : ['A','B','C'],
'Field2' : ['1','2','3'],
})
# create a list of columns
column_names = []
for index, row in mydf.iterrows():
for j in range( len(mydf['WeirdField'][index])):
column_names.append( mydf['WeirdField'][index][j][1])
# Create a unique set of columns names
new_column_names = list(set(column_names))
# Add list of columns to the new dataframe and populate with None
for i,j in enumerate(new_column_names):
newdf.insert(i+2,j,None)
# now add the elements into the columns
for index, row in mydf.iterrows():
for j in range( len(mydf['WeirdField'][index])):
newdf.set_value(index, [mydf['WeirdField'][index][j][1]], mydf['WeirdField'][index][j][0])
print newdf.head()
yields
Field1 Field2 F1 F2 F3 F4 F5 F6
0 A 1 xxx yyy None None None xyz
1 B 2 None None asd bla None None
2 C 3 mno 123 None None www None
Upvotes: 1
Reputation: 12515
First, I flattened the mydf['WeirdField']
column so we could see just the values and column names, and don't have to worry about the lists in which they're contained. Next, you can use itertools.groupby
to get all of the corresponding values and indices for each "F" column.
import itertools
# Must first sort the list by F column, or groupby won't work
flatter = sorted([list(x) + [idx] for idx, y in enumerate(mydf['WeirdField'])
for x in y], key = lambda x: x[1])
# Find all of the values that will eventually go in each F column
for key, group in itertools.groupby(flatter, lambda x: x[1]):
list_of_vals = [(val, idx) for val, _, idx in group]
# Add each value at the appropriate index and F column
for val, idx in list_of_vals:
mydf.loc[idx, key] = val
Yields this:
In [84]: mydf
Out[84]:
Field1 Field2 WeirdField F1 F2 F3 F4 F5
0 A 1 [(xxx, F1), (yyy, F2)] xxx yyy NaN NaN NaN
1 B 2 [(asd, F3), (bla, F4)] NaN NaN asd bla NaN
2 C 3 [(123, F2), (www, F5)] NaN 123 NaN NaN www
Upvotes: 4