blue-sky
blue-sky

Reputation: 53806

Concat column values based on condition

This code :

import numpy as np
import pandas as pd

df = pd.DataFrame(['a1', 'a2', 'stop', 'a4', 'a4', 'a5', 'stop', 'a3'],
                  columns=['c'])

renders:

      c
0    a1
1    a2
2  stop
3    a4
4    a4
5    a5
6  stop
7    a3

I'm attempting to produce the following dataframe where values in a column are concatenated until 'stop' value is encountered :

columns = ['c1' , 'c2']
data = np.array([['a1, a2','stop'] , ['a4, a4, a5','stop']])
df = pd.DataFrame(data, columns=columns)
df

           c1    c2
0      a1, a2  stop
1  a4, a4, a5  stop

Is this a valid approach, filter the rows where column value is 'stop':

df[df['c'] == 'stop']

then access the previous rows ?

Upvotes: 3

Views: 611

Answers (1)

Brad Solomon
Brad Solomon

Reputation: 40878

First, create a boolean mask by testing the equality of c to "stop":

>>> df = pd.DataFrame(['a1', 'a2', 'stop', 'a3', 'a4', 'a5', 'stop', 'a6'],
                      columns=['c'])
>>> mask = df['c'].eq('stop')

You also specified you want to ignore values after the final stop. Truncate both series with:

>>> stop = mask[::-1].idxmax()
>>> mask = mask[:stop]
>>> c = df['c'][:stop].copy()

Now groupby:

>>> c.groupby(mask.cumsum()).apply(lambda s: s[s!='stop'].tolist())
c
0        [a1, a2]
1    [a4, a4, a5]

With a cumulative sum, True maps to 1 and False maps to 0. This serves as the grouping.

A footnote - this logic should work regardless of whether the final value in the Series ends in a stop or not.

Upvotes: 5

Related Questions