DaCard
DaCard

Reputation: 651

How to expand a column with an accumulated sequence of its delimited values?

I would like to expand/repeat # of rows based on the length of sequences. For example:

import pandas as pd 
df = pd.DataFrame({"id":[11,12], "seq":['A,B,C', 'A,B,C,D'], "y":[1, 5]})
print(df)

   id      seq  y
0  11    A,B,C  1
1  12  A,B,C,D  5

I would this df to be

   id      seq     y
0  11      A       1
1  11      A,B     1
2  11      A,B,C   1
3  12      A       5
4  12      A,B     5
5  12      A,B,C   5
6  12      A,B,C,D 5

Any suggestions? Thanks

Upvotes: 3

Views: 364

Answers (4)

mozway
mozway

Reputation: 261280

Assuming a unique index (or a unique ID in a given column), and since you need to loop anyway, you could avoid using explode by directly creating the correct output column and expanding the original dataframe with join (or merge if using a column as ID):

s = pd.Series(*zip(*((','.join(l[:i+1]), idx)
                     for idx, l in zip(df.index, df['seq'].str.split(','))
                     for i in range(len(l)))),
             name='seq')

out = df.drop(columns='seq').join(s)[df. columns]

Alternatively, a step by step method to construct s:

# split strings
tmp = df['seq'].str.split(',')
# construct the incremental strings
vals = [','.join(l[:i+1])
        for l in tmp
        for i in range(len(l))]

# convert to Series
s = pd.Series(vals, index=df.index.repeat(tmp.str.len()),
              name='seq')

# join
out = df.drop(columns='seq').join(s)[df. columns]

Output:

   id      seq  y
0  11        A  1
0  11      A,B  1
0  11    A,B,C  1
1  12        A  5
1  12      A,B  5
1  12    A,B,C  5
1  12  A,B,C,D  5

Intermediate s:

0          A
0        A,B
0      A,B,C
1          A
1        A,B
1      A,B,C
1    A,B,C,D
Name: seq, dtype: object

Upvotes: 2

Henry Ecker
Henry Ecker

Reputation: 35676

Another option would be to use groupby expanding after explode:

df['seq'] = df['seq'].str.split(',')
df = df.explode('seq')
df['seq'] = [','.join(ewv) for ewv in df.groupby(level=0).expanding()['seq']]
df = df.reset_index(drop=True)

df:

   id      seq  y
0  11        A  1
1  11      A,B  1
2  11    A,B,C  1
3  12        A  5
4  12      A,B  5
5  12    A,B,C  5
6  12  A,B,C,D  5

Upvotes: 2

ifly6
ifly6

Reputation: 5331

First, splitting and exploding to get new rows:

df['seq1'] = df['seq'].str.split(',')
df = df.explode('seq1')

Then in groups, for each sequence, do a cumulative appending of the values. Then split and join them while stripping the leading and trailing commas that appear due to that split.

df['l'] = df.groupby('id')['seq1'].apply(
    lambda s: s.cumsum().str.split('').str.join(',').str.strip(','))

Thus,

>>> df
   id      seq  y seq1        l
0  11    A,B,C  1    A        A
0  11    A,B,C  1    B      A,B
0  11    A,B,C  1    C    A,B,C
1  12  A,B,C,D  5    A        A
1  12  A,B,C,D  5    B      A,B
1  12  A,B,C,D  5    C    A,B,C
1  12  A,B,C,D  5    D  A,B,C,D

Upvotes: 1

Arne
Arne

Reputation: 10545

You could collect the odd-length prefixes of each seq string in a list comprehension and then explode the dataframe on those lists:

df['seq'] = df.seq.apply(lambda s: [s[:i] for i in range(1, len(s)+1, 2)])
df = df.explode('seq')
df.reset_index(drop=True)
    id  seq      y
0   11  A        1
1   11  A,B      1
2   11  A,B,C    1
3   12  A        5
4   12  A,B      5
5   12  A,B,C    5
6   12  A,B,C,D  5

Upvotes: 1

Related Questions