Reputation: 651
I would like to expand/repeat # of rows based on the length of sequences. For example:
import pandas as pd
df = pd.DataFrame({"id":[11,12], "seq":['A,B,C', 'A,B,C,D'], "y":[1, 5]})
print(df)
id seq y
0 11 A,B,C 1
1 12 A,B,C,D 5
I would this df to be
id seq y
0 11 A 1
1 11 A,B 1
2 11 A,B,C 1
3 12 A 5
4 12 A,B 5
5 12 A,B,C 5
6 12 A,B,C,D 5
Any suggestions? Thanks
Upvotes: 3
Views: 364
Reputation: 261280
Assuming a unique index (or a unique ID in a given column), and since you need to loop anyway, you could avoid using explode
by directly creating the correct output column and expanding the original dataframe with join
(or merge
if using a column as ID):
s = pd.Series(*zip(*((','.join(l[:i+1]), idx)
for idx, l in zip(df.index, df['seq'].str.split(','))
for i in range(len(l)))),
name='seq')
out = df.drop(columns='seq').join(s)[df. columns]
Alternatively, a step by step method to construct s
:
# split strings
tmp = df['seq'].str.split(',')
# construct the incremental strings
vals = [','.join(l[:i+1])
for l in tmp
for i in range(len(l))]
# convert to Series
s = pd.Series(vals, index=df.index.repeat(tmp.str.len()),
name='seq')
# join
out = df.drop(columns='seq').join(s)[df. columns]
Output:
id seq y
0 11 A 1
0 11 A,B 1
0 11 A,B,C 1
1 12 A 5
1 12 A,B 5
1 12 A,B,C 5
1 12 A,B,C,D 5
Intermediate s
:
0 A
0 A,B
0 A,B,C
1 A
1 A,B
1 A,B,C
1 A,B,C,D
Name: seq, dtype: object
Upvotes: 2
Reputation: 35676
Another option would be to use groupby expanding
after explode
:
df['seq'] = df['seq'].str.split(',')
df = df.explode('seq')
df['seq'] = [','.join(ewv) for ewv in df.groupby(level=0).expanding()['seq']]
df = df.reset_index(drop=True)
df
:
id seq y
0 11 A 1
1 11 A,B 1
2 11 A,B,C 1
3 12 A 5
4 12 A,B 5
5 12 A,B,C 5
6 12 A,B,C,D 5
Upvotes: 2
Reputation: 5331
First, splitting and exploding to get new rows:
df['seq1'] = df['seq'].str.split(',')
df = df.explode('seq1')
Then in groups, for each sequence, do a cumulative appending of the values. Then split and join them while stripping the leading and trailing commas that appear due to that split.
df['l'] = df.groupby('id')['seq1'].apply(
lambda s: s.cumsum().str.split('').str.join(',').str.strip(','))
Thus,
>>> df
id seq y seq1 l
0 11 A,B,C 1 A A
0 11 A,B,C 1 B A,B
0 11 A,B,C 1 C A,B,C
1 12 A,B,C,D 5 A A
1 12 A,B,C,D 5 B A,B
1 12 A,B,C,D 5 C A,B,C
1 12 A,B,C,D 5 D A,B,C,D
Upvotes: 1
Reputation: 10545
You could collect the odd-length prefixes of each seq
string in a list comprehension and then explode the dataframe on those lists:
df['seq'] = df.seq.apply(lambda s: [s[:i] for i in range(1, len(s)+1, 2)])
df = df.explode('seq')
df.reset_index(drop=True)
id seq y
0 11 A 1
1 11 A,B 1
2 11 A,B,C 1
3 12 A 5
4 12 A,B 5
5 12 A,B,C 5
6 12 A,B,C,D 5
Upvotes: 1