Reputation: 131
I have a dataframe with 5 columns, each column containing lists of variable lengths. This is what a row in my dataframe looks like:
A B
1 [aircrafts, they, agreement, airplane] [are, built, made, built]
Now I would like to 'unpack' or 'unstack' these lists so that each cell only contains one value (one word). In the unpacking process, the words in cells from one column should be combined pairwise with the corresponding value in the next column. The result would then be:
A B
1 aircrafts are
2 they built
3 agreement made
4 airplane built
For reference, my full dataframe looks as follows:
obj rel1 \
0 [Boeing] [sells]
1 [aircrafts, they, agreement, airplane] [are, built, made, built]
2 [exception, these] [are, are]
3 [sales, contract] [regulated, consist]
4 [contract] [stipulates]
5 [acquisition] [has]
6 [contract] [managed]
7 [employee] [act]
8 [salesperson, Boeing] [change, ensures]
9 [airlines, airlines] [related, have]
10 [Boeing] [keep]
dep1 rel2 \
0 [aircrafts] [to]
1 [] [on, with]
2 [] [of, of, for]
3 [] [by, of, with, of, of]
4 [elements] [across, as]
5 [details] [of, as, of, for]
6 [] [by]
7 [] [as, for]
8 [] [Given, of, over, for]
9 [company] [to, for]
10 [track, aircrafts] [of, between, to, of]
dep2
0 [companies]
1 [demand, customer]
2 [airplanes, scope, case]
3 [means, contracts, companies, acquisitions, ai...
4 [acquisitions, conditions]
5 [airplane, model, airplane, options]
6 [salesperson]
7 [salesperson, contracts]
8 [term, contracts, time, client]
9 [other, example]
10 [relationships, companies, airlines, buyer]
How can I perform the 'unpacking' and rearranging operations in python? It would be great if these operations could be performed on the dataframe itself. If this proves to be difficult or impossible, is there a way I could rearrange the data in the lists before combining them into a dataframe?
Thank you very much for any help or advice.
Upvotes: 2
Views: 194
Reputation: 999
I believe you want to be able to read the rows as sentences
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(
obj=[['Boeing'], ['aircrafts', 'they', 'agreement', 'airplane'], ['exception', 'these']],
rel1=[['sells'], ['are', 'built', 'made', 'built'],['are', 'are']],
dep1=[['aircrafts'], [], []],
rel2=[['to'], ['on', 'with'], ['of', 'of', 'for']]
))
# Output dataframe
out = pd.DataFrame()
# Keep track of which set of rows we have already appended to the dataframe
row_counter = 0
# Loop through each row in the input dataframe
for row, value in df.iterrows():
# Get the max len of each list in this row
rows_needed = value.map(lambda x: len(x)).max()
for i in range(rows_needed):
# Set a name for this row (numeric)
new_row = pd.Series(name=row_counter+i)
# Loop through each header and create a single row per item in the list
for header in value.index:
# Find if there will be a value here or null
this_value = np.nan
if i < len(df.loc[row, header]):
this_value = df.loc[row, header][i]
# Add this single result to a series for this row
new_row = new_row.set_value(header, this_value)
# Add this row series to the full dataframe
out = out.append(new_row)
row_counter += rows_needed
out
Out[1]:
dep1 obj rel1 rel2
0 aircrafts Boeing sells to
1 NaN aircrafts are on
2 NaN they built with
3 NaN agreement made NaN
4 NaN airplane built NaN
5 NaN exception are of
6 NaN these are of
7 NaN NaN NaN for
order = ['obj', 'rel1', 'dep1', 'rel2']
out = out[order]
out
Out[2]:
obj rel1 dep1 rel2
0 Boeing sells aircrafts to
1 aircrafts are NaN on
2 they built NaN with
3 agreement made NaN NaN
4 airplane built NaN NaN
5 exception are NaN of
6 these are NaN of
7 NaN NaN NaN for
Upvotes: 1