Reputation: 53806
This code :
import numpy as np
import pandas as pd
df = pd.DataFrame(['a1', 'a2', 'stop', 'a4', 'a4', 'a5', 'stop', 'a3'],
columns=['c'])
renders:
c
0 a1
1 a2
2 stop
3 a4
4 a4
5 a5
6 stop
7 a3
I'm attempting to produce the following dataframe where values in a column are concatenated until 'stop' value is encountered :
columns = ['c1' , 'c2']
data = np.array([['a1, a2','stop'] , ['a4, a4, a5','stop']])
df = pd.DataFrame(data, columns=columns)
df
c1 c2
0 a1, a2 stop
1 a4, a4, a5 stop
Is this a valid approach, filter the rows where column value is 'stop':
df[df['c'] == 'stop']
then access the previous rows ?
Upvotes: 3
Views: 611
Reputation: 40878
First, create a boolean mask by testing the equality of c
to "stop":
>>> df = pd.DataFrame(['a1', 'a2', 'stop', 'a3', 'a4', 'a5', 'stop', 'a6'],
columns=['c'])
>>> mask = df['c'].eq('stop')
You also specified you want to ignore values after the final stop. Truncate both series with:
>>> stop = mask[::-1].idxmax()
>>> mask = mask[:stop]
>>> c = df['c'][:stop].copy()
Now groupby:
>>> c.groupby(mask.cumsum()).apply(lambda s: s[s!='stop'].tolist())
c
0 [a1, a2]
1 [a4, a4, a5]
With a cumulative sum, True
maps to 1 and False
maps to 0. This serves as the grouping.
A footnote - this logic should work regardless of whether the final value in the Series ends in a stop
or not.
Upvotes: 5