Reputation: 2122
I have a dataframe that looks like this:
from to cc extra columns
-------------------------------------------
1 2 3 sth
1|1 4 sth
3 1|2 4|5 sth
I want a new dataframe that creates a new row for every pipe separated value like this:
from to cc extra columns
--------------------------------------------
1 2 3 sth
1 4 sth
1 4 sth
3 1 4 sth
3 2 4 sth
3 1 5 sth
3 2 5 sth
Can someone please help me with this?
Thank you!
Upvotes: 2
Views: 2631
Reputation: 294488
numpy
split
column by column.concatenate
target column while repeat
on the rest of the arraydef explode(v, i, sep='|'):
v = v.astype(str)
n, m = v.shape
a = v[:, i]
bslc = np.r_[0:i, i+1:m]
asrt = np.append(i, bslc).argsort()
b = v[:, bslc]
a = np.core.defchararray.split(a, sep)
A = np.concatenate(a)[:, None]
counts = [len(x) for x in a.tolist()]
rpt = np.arange(n).repeat(counts)
return np.concatenate([A, b[rpt]], axis=1)[:, asrt]
pd.DataFrame(
explode(explode(explode(df.values, 0), 1), 2),
columns=df.columns
)
from to cc extra_columns
0 1 2 3 sth
1 1 4 sth
2 1 4 sth
3 3 1 4 sth
4 3 1 5 sth
5 3 2 4 sth
6 3 2 5 sth
time test
given data
10 X faster
given data repeated 1000 times
100 X faster
Upvotes: 1
Reputation: 36691
The easiest way to do this is to move outside of pandas to do the list manipulation, then port the new data back into a pandas dataframe. Here we use product
from itertools to create all of the combinations after splitting each row into a bunch of lists.
Note that I use letters instead of number for my data.
import pandas as pd
from itertools import product
df = pd.DataFrame([['a', 'b', 'c', 'xtra'],
['a|a','d', '', 'xtra'],
['c', 'a|b','d|e','xtra']],
columns=['From','To','CC','extra_cols'])
split_data = []
for row in df.values.tolist():
split_data.extend(list(product(*[item.split('|') for item in row])))
new_df = pd.DataFrame(split_data, columns=['From','To','CC','extra_cols'])
> new_df
# From To CC extra_cols
# 0 a b c xtra
# 1 a d xtra
# 2 a d xtra
# 3 c a d xtra
# 4 c a e xtra
# 5 c b d xtra
# 6 c b e xtra
Upvotes: 0
Reputation: 1846
A not elegant but working solution:
import pandas as pd
import itertools
df = pd.read_csv('path/to/file.csv', index_col=None)
all_rows = []
for _, r in df.iterrows():
froms = str(r['from']).split('|')
to = str(r['to']).split('|')
ccs = str(r['cc']).split('|')
rows = [[f, t, cc] + list(r[df.columns[3:]]) for f, t, cc in itertools.product(froms, to, ccs)]
all_rows += rows
df_2 = pd.DataFrame(data=all_rows, columns=df.columns)
Upvotes: 2