Reputation: 85
I am trying to find a list of tuples with start and end values (i.e. rows) from a df2 dataframe looking over the index (the first or zero column df2[0]). df2 example:
COL0 COL1 COL2
4 x y # start 'tuple x' of COL1
5 i j
6 n m # end 'tuple n'
14 f a # start 'tuple f'
15 e b # end 'tuple e'
...
So COL0 consecutive values will form a group. If the next row is not consecutive (e.g. 6-14) then a new group starts. A selection could be the following:
Crit_a = df2[0][0] + 1 == df2[0][1]
As output, I am looking for a new df3 with per row the following:
COL0 COL1 COL2 COL3 COL4 ...
4 x y n m # start values and end values of COL1 and COL2
14 f a e b
I was looking at SO here and other locations. Thank you for you suggestions.
Upvotes: 1
Views: 406
Reputation: 109626
Not exactly your desired output, but perhaps more intuitive?
I create a column named group_no
to label the consecutive values from COL0
. I differenced the columns, located values where this difference was not one, and then did a cumsum on the result. The first element is ambiguous (it is NaN when differenced, so I check if its value plus one equals the second value. If so, the first value is continuous and assigned a value of 1. If not, it is not continuous and assigned a value of 0.
df = df.assign(group_no = (df.COL0.diff() != 1).cumsum())
df.group_no.iat[0] = 1 if df.COL0.iat[0] + 1 == df.COL0.iat[1] else 0
df_new = df.groupby('group_no').agg(
{'COL0': ['first'],
'COL1': ['first', 'last'],
'COL2': ['first', 'last']})
>>> df_new
COL2 COL0 COL1
first last first first last
group_no
1 y m 4 x n
2 a b 14 f e
The agg
function takes a dictionary, so the resulting order of the columns can be arbitrary. To order the resulting columns, you could do it explicitly, e.g.:
df_new[[('COL0', 'first'),
('COL1', 'first'),
('COL1', 'last'),
('COL2', 'first'),
('COL2', 'last')]]
This may also work:
n = 3 # First three columns of original dataframe.
df_new.loc[:, pd.IndexSlice[df.columns[:n], :]]
Upvotes: 1
Reputation: 19957
First create a group key for rows belonging to the same group.
df['COL0'] = \
(df.COL0.rolling(2,min_periods=1)
.apply(lambda x: x[-1] if int(x[-1]-x[0])!=1 else np.nan)
.ffill()
)
Then group by the key and find the start and end row and expand them as columns. Finally rename the columns.
df2 = \
(df[['COL1','COL2']].groupby(df.COL0)
.apply(lambda x: np.array([x.iloc[0],x.iloc[-1]]).flatten())
.apply(pd.Series)
.rename_axis(lambda x: 'COL'+str(x+1),axis=1)
)
df2
Out[178]:
COL1 COL2 COL3 COL4
COL0
4.0 x y n m
14.0 f a e b
Upvotes: 1