Reputation: 193

replicate rows in pandas by specific column with the values from that column

What would be the most efficient way to solve this problem?

i_have = pd.DataFrame(data={
  'id': ['A', 'B', 'C'],
  'v' : [ 's,m,l',  '1,2,3',   'k,g']
})

i_need = pd.DataFrame(data={
  'id': ['A','A','A','B','B','B','C', 'C'],
  'v' : ['s','m','l','1','2','3','k','g']
})

I though about creating a new df and while iterating over i_have append the records to the new df. But as number of rows grow, it can take a while.

Upvotes: 1

Answers (3)

Zero

Reputation: 77027

Here's another way

In [1667]: (i_have.set_index('id').v.str.split(',').apply(pd.Series)
                  .stack().reset_index(name='v').drop('level_1', 1))
Out[1667]:
  id  v
0  A  s
1  A  m
2  A  l
3  B  1
4  B  2
5  B  3
6  C  k
7  C  g

As pointed in comment.

In [1672]: (i_have.set_index('id').v.str.split(',', expand=True)
                  .stack().reset_index(name='v').drop('level_1', 1))
Out[1672]:
  id  V
0  A  s
1  A  m
2  A  l
3  B  1
4  B  2
5  B  3
6  C  k
7  C  g

Upvotes: 1

Bharath M Shetty

Reputation: 30605

If you have multiple columns then first split the data by , with expand = True(Thank you piRSquared) then stack and ffill i.e

i_have = pd.DataFrame(data={
  'id': ['A', 'B', 'C'],
  'v' : [ 's,m,l',  '1,2,3',   'k,g'],
  'w' : [ 's,8,l',  '1,2,3',   'k,g'],
  'x' : [ 's,0,l',  '1,21,3',   'ks,g'],
  'y' : [ 's,m,l',  '11,2,3',   'ks,g'],  
  'z' : [ 's,4,l',  '1,2,32',   'k,gs'],
})

i_want = i_have.apply(lambda x :x.str.split(',',expand=True).stack()).reset_index(level=1,drop=True).ffill()

If the values are not equal sized then

i_want = i_have.apply(lambda x :x.str.split(',',expand=True).stack()).reset_index(level=1,drop=True)
i_want['id'] = i_want['id'].ffill()

Output i_want

  id  v  w   x   y   z
0  A  s  s   s   s   s
1  A  m  8   0   m   4
2  A  l  l   l   l   l
3  B  1  1   1  11   1
4  B  2  2  21   2   2
5  B  3  3   3   3  32
6  C  k  k  ks  ks   k
7  C  g  g   g   g  gs

Upvotes: 1

jezrael

Reputation: 863741

Use numpy.repeat with numpy.concatenate for flattening:

#create lists by split
splitted = i_have['v'].str.split(',')
#get legths of each lists 
lens = splitted.str.len()

df = pd.DataFrame({'id':np.repeat(i_have['id'], lens),
                    'v':np.concatenate(splitted)})
print (df)
  id  v
0  A  s
0  A  m
0  A  l
1  B  1
1  B  2
1  B  3
2  C  k
2  C  g

Thank you piRSquared for solution for repeat multiple columns:

i_have = pd.DataFrame(data={
  'id': ['A', 'B', 'C'],
  'id1': ['A1', 'B1', 'C1'],
  'v' : [ 's,m,l',  '1,2,3',   'k,g']
})
print (i_have)
  id id1      v
0  A  A1  s,m,l
1  B  B1  1,2,3
2  C  C1    k,g

splitted = i_have['v'].str.split(',')
lens = splitted.str.len()

df = i_have.loc[i_have.index.repeat(lens)].assign(v=np.concatenate(splitted))
print (df)
  id id1  v
0  A  A1  s
0  A  A1  m
0  A  A1  l
1  B  B1  1
1  B  B1  2
1  B  B1  3
2  C  C1  k
2  C  C1  g

Upvotes: 3

replicate rows in pandas by specific column with the values from that column

Answers (3)

Related Questions