Tom
Tom

Reputation: 131

How to 'unstack' all lists in columns of dataframe and combine results pairwise?

I have a dataframe with 5 columns, each column containing lists of variable lengths. This is what a row in my dataframe looks like:

               A                                         B
1 [aircrafts, they, agreement, airplane]  [are, built, made, built] 

Now I would like to 'unpack' or 'unstack' these lists so that each cell only contains one value (one word). In the unpacking process, the words in cells from one column should be combined pairwise with the corresponding value in the next column. The result would then be:

        A            B
   1  aircrafts      are 
   2  they           built
   3  agreement      made
   4  airplane       built

For reference, my full dataframe looks as follows:

   obj                       rel1  \
0                                 [Boeing]                    [sells]   
1   [aircrafts, they, agreement, airplane]  [are, built, made, built]   
2                       [exception, these]                 [are, are]   
3                        [sales, contract]       [regulated, consist]   
4                               [contract]               [stipulates]   
5                            [acquisition]                      [has]   
6                               [contract]                  [managed]   
7                               [employee]                      [act]   
8                    [salesperson, Boeing]          [change, ensures]   
9                     [airlines, airlines]            [related, have]   
10                                [Boeing]                     [keep]   

                  dep1                    rel2  \
0          [aircrafts]                    [to]   
1                   []              [on, with]   
2                   []           [of, of, for]   
3                   []  [by, of, with, of, of]   
4           [elements]            [across, as]   
5            [details]       [of, as, of, for]   
6                   []                    [by]   
7                   []               [as, for]   
8                   []  [Given, of, over, for]   
9            [company]               [to, for]   
10  [track, aircrafts]   [of, between, to, of]   

                                                 dep2  
0                                         [companies]  
1                                  [demand, customer]  
2                            [airplanes, scope, case]  
3   [means, contracts, companies, acquisitions, ai...  
4                          [acquisitions, conditions]  
5                [airplane, model, airplane, options]  
6                                       [salesperson]  
7                            [salesperson, contracts]  
8                     [term, contracts, time, client]  
9                                    [other, example]  
10        [relationships, companies, airlines, buyer]        

How can I perform the 'unpacking' and rearranging operations in python? It would be great if these operations could be performed on the dataframe itself. If this proves to be difficult or impossible, is there a way I could rearrange the data in the lists before combining them into a dataframe?

Thank you very much for any help or advice.

Upvotes: 2

Views: 194

Answers (1)

Dillon
Dillon

Reputation: 999

I believe you want to be able to read the rows as sentences

import pandas as pd
import numpy as np

df = pd.DataFrame(dict(
    obj=[['Boeing'], ['aircrafts', 'they', 'agreement', 'airplane'], ['exception', 'these']],
    rel1=[['sells'], ['are', 'built',  'made',  'built'],['are', 'are']],
    dep1=[['aircrafts'], [], []],
    rel2=[['to'], ['on', 'with'], ['of', 'of', 'for']]
))

# Output dataframe
out = pd.DataFrame()

# Keep track of which set of rows we have already appended to the dataframe
row_counter = 0

# Loop through each row in the input dataframe
for row, value in df.iterrows():

    # Get the max len of each list in this row
    rows_needed = value.map(lambda x: len(x)).max()

    for i in range(rows_needed):

        # Set a name for this row (numeric)
        new_row = pd.Series(name=row_counter+i)

        # Loop through each header and create a single row per item in the list
        for header in value.index:

            # Find if there will be a value here or null
            this_value = np.nan
            if i < len(df.loc[row, header]):
                this_value = df.loc[row, header][i]

            # Add this single result to a series for this row
            new_row = new_row.set_value(header, this_value)

        # Add this row series to the full dataframe
        out = out.append(new_row)
    row_counter += rows_needed

out
Out[1]:
        dep1        obj   rel1  rel2
0  aircrafts     Boeing  sells    to
1        NaN  aircrafts    are    on
2        NaN       they  built  with
3        NaN  agreement   made   NaN
4        NaN   airplane  built   NaN
5        NaN  exception    are    of
6        NaN      these    are    of
7        NaN        NaN    NaN   for

order = ['obj', 'rel1', 'dep1', 'rel2']
out = out[order]
out
Out[2]:
         obj   rel1       dep1  rel2
0     Boeing  sells  aircrafts    to
1  aircrafts    are        NaN    on
2       they  built        NaN  with
3  agreement   made        NaN   NaN
4   airplane  built        NaN   NaN
5  exception    are        NaN    of
6      these    are        NaN    of
7        NaN    NaN        NaN   for

Upvotes: 1

Related Questions