irene
irene

Reputation: 2243

split several columns with tuples into separate columns

So I know there are a lot of posts on how to split a single column of tuples into several columns. But I haven't seen any on how to split several columns of tuples at the same time. I've tried doing a nested apply:

spl.apply(lambda col: spl[col].apply(pd.Series)) 

but it doesn't work.

spl = pd.DataFrame({('a', 'b') : [(i, i+1) for i in range(3)], 
                    ('a', 'c'): [(2*i, 2*i+1) for i in range(3)], 
                    ('e', 'b'): [(2*i, 2*i+1) for i in range(3)]})
>>> spl
                a        e
         b      c        b
0   (0, 1)  (0, 1)  (0, 1)
1   (1, 2)  (2, 3)  (2, 3)
2   (2, 3)  (4, 5)  (4, 5)

The target is:

                a          e
    b0  b1  c0  c1   b0   b1
0   0   1   0    1    0    1
1   1   2   2    3    2    3
2   2   3   4    5    4    5

Upvotes: 1

Views: 152

Answers (3)

sammywemmy
sammywemmy

Reputation: 28644

One option is to dump it into vanilla python, take advantage of python's structures before rebuilding the DataFrame; depending on your data size, it may be more efficient (and hopefully easier to understand, if you ever need to refer to it) - of course tests are your true way to verify performance (thanks to @wjandrea for the improvements):


frame = {}

for (top, bottom), value in spl.items():
    for i, arr in enumerate(zip(*value)):
        frame[(top, f"{bottom}{i}")] = arr

pd.DataFrame(frame)

   a           e   
  b0 b1 c0 c1 b0 b1
0  0  1  0  1  0  1
1  1  2  2  3  2  3
2  2  3  4  5  4  5

Upvotes: 2

user7864386
user7864386

Reputation:

We could use unstack + DataFrame constructor + pivot. We can't pivot a Series object, so we reset_index first.

To fix duplicate column names, we could reformat column names by creating a MultiIndex object by combining original column name level 1 with the tuple index:

tmp = spl.unstack()
out = (pd.pivot(pd.DataFrame(tmp.tolist(), index=tmp.index).reset_index(), 
                ['level_2'], ['level_0','level_1'], [0,1])
       .rename_axis(columns=[None, None, None], index=[None]))
out.columns = pd.MultiIndex.from_arrays((out.columns.get_level_values(1), 
                                         out.columns.get_level_values(2) + 
                                         out.columns.get_level_values(0).astype(str)))
out = out.sort_index(axis=1)

Output:

   a           e   
  b0 b1 c0 c1 b0 b1
0  0  1  0  1  0  1
1  1  2  2  3  2  3
2  2  3  4  5  4  5

As @sammywemmy suggested in the comments to this question, rebuilding the DataFrame from vanilla Python is indeed easier than the mess above.

from itertools import chain, product
out = pd.DataFrame([list(chain.from_iterable(lst)) for lst in spl.to_numpy().tolist()], 
                   columns=pd.MultiIndex.from_tuples((i, j+str(k)) for (i,j), k in 
                                                     product(spl.columns, [0,1])))

Upvotes: 2

user17242583
user17242583

Reputation:

Here's a solution:

new_df = pd.concat([pd.DataFrame(spl[c].tolist()).add_prefix(c[-1]) for c in spl], axis=1)
new_df.columns = pd.MultiIndex.from_arrays([np.repeat(spl.columns.get_level_values(0), 2), new_df.columns])

Output:

>>> new_df
   a           e   
  b0 b1 c0 c1 b0 b1
0  0  1  0  1  0  1
1  1  2  2  3  2  3
2  2  3  4  5  4  5

One-big-liner :)

new_df = pd.concat([pd.DataFrame(spl[c].tolist()).add_prefix(c[-1]) for c in spl], axis=1).pipe(lambda x: x.set_axis(pd.MultiIndex.from_arrays([np.repeat(spl.columns.get_level_values(0), 2), x.columns]), axis=1))

Upvotes: 2

Related Questions