Gingerbread
Gingerbread

Reputation: 2122

Pandas dataframe : Pipe separated values in a cell

I have a dataframe that looks like this:

from      to         cc      extra columns
-------------------------------------------
1          2          3           sth
1|1        4                      sth
3          1|2       4|5          sth

I want a new dataframe that creates a new row for every pipe separated value like this:

from       to        cc       extra columns
--------------------------------------------
1           2         3          sth
1           4                    sth
1           4                    sth
3           1         4          sth
3           2         4          sth
3           1         5          sth
3           2         5          sth

Can someone please help me with this?

Thank you!

Upvotes: 2

Views: 2631

Answers (3)

piRSquared
piRSquared

Reputation: 294488

numpy

  • split column by column.
  • concatenate target column while repeat on the rest of the array

def explode(v, i, sep='|'):
    v = v.astype(str)
    n, m = v.shape
    a = v[:, i]
    bslc = np.r_[0:i, i+1:m]
    asrt = np.append(i, bslc).argsort()
    b = v[:, bslc]
    a = np.core.defchararray.split(a, sep)
    A = np.concatenate(a)[:, None]
    counts = [len(x) for x in a.tolist()]
    rpt = np.arange(n).repeat(counts)
    return np.concatenate([A, b[rpt]], axis=1)[:, asrt]

pd.DataFrame(
    explode(explode(explode(df.values, 0), 1), 2),
    columns=df.columns
)

  from to cc extra_columns
0    1  2  3           sth
1    1  4              sth
2    1  4              sth
3    3  1  4           sth
4    3  1  5           sth
5    3  2  4           sth
6    3  2  5           sth

time test
given data
10 X faster
enter image description here

given data repeated 1000 times
100 X faster
enter image description here

Upvotes: 1

James
James

Reputation: 36691

The easiest way to do this is to move outside of pandas to do the list manipulation, then port the new data back into a pandas dataframe. Here we use product from itertools to create all of the combinations after splitting each row into a bunch of lists.

Note that I use letters instead of number for my data.

import pandas as pd
from itertools import product

df = pd.DataFrame([['a',  'b', 'c', 'xtra'],
                   ['a|a','d',  '', 'xtra'],
                   ['c',  'a|b','d|e','xtra']],
                   columns=['From','To','CC','extra_cols'])

split_data = []
for row in df.values.tolist():
    split_data.extend(list(product(*[item.split('|') for item in row])))

new_df = pd.DataFrame(split_data, columns=['From','To','CC','extra_cols'])

> new_df
#   From To CC extra_cols
# 0    a  b  c       xtra
# 1    a  d          xtra
# 2    a  d          xtra
# 3    c  a  d       xtra
# 4    c  a  e       xtra
# 5    c  b  d       xtra
# 6    c  b  e       xtra

Upvotes: 0

scomes
scomes

Reputation: 1846

A not elegant but working solution:

import pandas as pd
import itertools

df = pd.read_csv('path/to/file.csv', index_col=None)

all_rows = []
for _, r in df.iterrows():

    froms = str(r['from']).split('|')
    to = str(r['to']).split('|')
    ccs = str(r['cc']).split('|')

    rows = [[f, t, cc] + list(r[df.columns[3:]]) for f, t, cc in itertools.product(froms, to, ccs)]
    all_rows += rows

df_2 = pd.DataFrame(data=all_rows, columns=df.columns)

Upvotes: 2

Related Questions