Diego
Diego

Reputation: 36176

List of tuples to data frame

looking for ideas on how to solve the problem bellow:

I have a data frame where one of my columns contains a list of tuples like the following:

mydf = pd.DataFrame({ 
        'Field1' : ['A','B','C'],
        'Field2' : ['1','2','3'],
        'WeirdField' :[ 
                      [ ('xxx', 'F1'), ('yyy','F2') ],
                      [ ('asd', 'F3'), ('bla','F4') ],
                      [ ('123', 'F2'), ('www','F5') ]
                      ]
        })

enter image description here

I would like each element on the second position of the tuple to become a column on the data frame with the corespondent value on the first position. For the data frame above, this is what I am expecting:

enter image description here

the list can have several elements (not only 2 as the example) and the number of elements can vary across the rows.

can anyone suggest how to easily achieve this?

Thanks

Upvotes: 3

Views: 1206

Answers (3)

Parfait
Parfait

Reputation: 107767

Consider a pivot_table solution after zipping through column values. This will work in any number of tuples in WeirdField assuming none of the F's repeat in same row as pivot will take max value:

data =[]
# APPEND TO LIST
for f1,f2,w in zip(mydf['Field1'].values, mydf['Field2'].values, mydf['WeirdField'].values):
    for i in w:
        data.append((f1, f2) + i)
# CAST LIST OF TUPLES TO DATAFRAME
df = pd.DataFrame(data, columns=['Field1', 'Field2', 'Value', 'Indicator'])

# PIVOT DATAFRAME
pvt = df.pivot_table(index=['Field1', 'Field2'], columns=['Indicator'],
                     values='Value', aggfunc='max', fill_value=np.nan).reset_index()
pvt.columns.name = None

#   Field1 Field2   F1   F2   F3   F4   F5
# 0      A      1  xxx  yyy  NaN  NaN  NaN
# 1      B      2  NaN  NaN  asd  bla  NaN
# 2      C      3  NaN  123  NaN  NaN  www

Upvotes: 1

plasmon360
plasmon360

Reputation: 4199

import pandas as pd

mydf = pd.DataFrame({ 
        'Field1' : ['A','B','C'],
        'Field2' : ['1','2','3'],
        'WeirdField' :[ 
                      [ ('xxx', 'F1'), ('yyy','F2'),('xyz','F6') ],
                      [ ('asd', 'F3'), ('bla','F4') ],
                      [ ('123', 'F2'), ('www','F5') ,('mno','F1') ]
                      ]
        })

print mydf.head()

# Create a new data frame with just field1 and field2

newdf = pd.DataFrame({'Field1' : ['A','B','C'],
        'Field2' : ['1','2','3'],
        })
# create a list of columns
column_names = []
for index, row in mydf.iterrows():
    for j in range( len(mydf['WeirdField'][index])):
        column_names.append( mydf['WeirdField'][index][j][1])

# Create a unique set of columns names
new_column_names = list(set(column_names))

# Add list of columns to the new dataframe and populate with None
for i,j in enumerate(new_column_names):
    newdf.insert(i+2,j,None)

# now add the elements into the columns
for index, row in mydf.iterrows():
    for j in range( len(mydf['WeirdField'][index])):
        newdf.set_value(index, [mydf['WeirdField'][index][j][1]], mydf['WeirdField'][index][j][0])

print newdf.head()

yields

  Field1 Field2    F1    F2    F3    F4    F5    F6
0      A      1   xxx   yyy  None  None  None   xyz
1      B      2  None  None   asd   bla  None  None
2      C      3   mno   123  None  None   www  None

Upvotes: 1

boot-scootin
boot-scootin

Reputation: 12515

First, I flattened the mydf['WeirdField'] column so we could see just the values and column names, and don't have to worry about the lists in which they're contained. Next, you can use itertools.groupby to get all of the corresponding values and indices for each "F" column.

import itertools

# Must first sort the list by F column, or groupby won't work                  
flatter = sorted([list(x) + [idx] for idx, y in enumerate(mydf['WeirdField']) 
                  for x in y], key = lambda x: x[1]) 

# Find all of the values that will eventually go in each F column                
for key, group in itertools.groupby(flatter, lambda x: x[1]):
    list_of_vals = [(val, idx) for val, _, idx in group]

    # Add each value at the appropriate index and F column
    for val, idx in list_of_vals:
        mydf.loc[idx, key] = val

Yields this:

In [84]: mydf
Out[84]: 
  Field1 Field2              WeirdField   F1   F2   F3   F4   F5
0      A      1  [(xxx, F1), (yyy, F2)]  xxx  yyy  NaN  NaN  NaN
1      B      2  [(asd, F3), (bla, F4)]  NaN  NaN  asd  bla  NaN
2      C      3  [(123, F2), (www, F5)]  NaN  123  NaN  NaN  www

Upvotes: 4

Related Questions